Last Updated:

String operator in Pascal

String is a type of data representation in which the values of variables are alphabetic characters. Any of the variables can be with a certain number of bytes, or completely different lengths.

String operator in Pascal

How a type is represented in memory

Many programming languages have maximum values for string sizes, while other languages do not. If you are using Unicode, each character of the string type can take up 2 or 4 bytes to represent it.

As elsewhere, there are several shortcomings in the presentation of this type of data:

  1. Each line can take up a considerable amount of space (up to many tens of MB);
  2. Ambiguous size, hence the problems when editing the text.

How to treat your computer's memory

There are two different approaches to representing a string in memory.

Array

In this approach, the array performs the representation to the computer. All sizes, when using this method, are distributed into different areas. For the first time, this approach was used in the Pascal language, from which it received the name Pascal strings.

This method is outdated and has been optimized, resulting in the caddr format. It differs from Pascal strings in that arrays and their sizes act as particles pointing to a particular string.

Advantages:

At any time, the program will have access to information about the size of the rows, so that many actions will be performed much faster;

  • Ability to store different types of data;
  • It is possible to track the behavior of the string;
  • Operations like "take the Nth character from the end" are performed much faster.

Disadvantages:

  • Character processing is problematic for strings of arbitrary length;
  • Large amount of allocated memory for storage;
  • Rows do not have an unlimited size. In more modern languages, the effect is not so strong, because the maximum can reach up to 4 gigabytes;
  • If you use alphabets that have variable character size (UTF-8), the byte size of all strings will be stored, rather than the number of characters in them, which is why it must be counted regardless of the row sizes.

"Final Byte"

The second method involves using the final byte. The random value of a letter or symbol (usually 0) will be used as the endpoint, and the string itself will be stored as a sequence of bytes. In some systems, the value of symbols is taken as 255 instead of zero.

This method has several names:

  1. ASCIIZ (value 0 at the end);
  2. C-strings (most popular in C);
  3. Zero-terminated strings.

Advantages:

  • The sequence gets rid of the service information (not counting the final byte);
  • You can imagine it without creating a new data type;
  • There is no limit to the row size;
  • Reasonable memory allocation;
  • Functions for passing strings fall on the initial character.

Disadvantages:

  • Longer execution of operations in order to find out information about the line;
  • The output of the maximum value is not controlled;
  • Failures in the end byte can damage a significant area of memory (this, in turn, can lead to serious damage);
  • A final byte, or rather its symbol, cannot be used as part of a sequence;
  • Inability to use certain libraries in which a character can span up to several bytes (UTF-16).

Using two methods at the same time

In many languages, such as Oberon, a string is represented by an array, and its end has a null character. This method combines both approaches, combining their advantages and avoiding many disadvantages.

What are the operations

Simple operations

These include:

  • Using indexes to find symbols on them;
  • Connect one sequence to another.

Derived operations

Include:

  • Finding a substring by the number of the final and initial character;
  • Finding and replacing the substring;
  • Prevent repetition of lines;
  • Obtaining information about the length of the sequence;
  • Convolution;
  • Filter criteria for the same lists.

More complex operations

Imply:

  • Search for the smallest superstructure;
  • Search for strings that match in different arrays;
  • Tasks using natural language;
  • Similarity of sequences according to certain criteria;
  • The ability to define the encoding as well as the language of the character set used.

Representation of string characters

In times past, the character encoding looked like 1 character = 1 byte, that is, 8 bits (there were cases of 1 character = 7 bits). This made it possible to apply 256 (128 at 7-bit encoding) values. But to fully present the information, these 256 characters were not enough. To solve this problem, the following methods were used:

Use control codes to switch languages. When using this method, the sequence of symbols lost its meaning due to the lack of control code at the beginning, but still found its place in ZX-Spectrum and BC.

Use UTF-16 and UTF-32 (a few bytes per character). This method does not allow you to combine yourself with other areas that are used to work with text. Due to the fact that the symbol "0" could occur absolutely anywhere in the line, this interfered with the work of libraries.

Use different floating symbol (UTF-8) encoding methods. This method will lead to problems in using the direct address of the symbol, but allows you to combine yourself with legacy libraries.