Last Updated:

Perl : Collection of recipes for professionals

1.0. Introduction

Many programming languages force us to think at an inconvenient low level. You need a string, and the language wants you to work with pointers or bytes. Such trifles only distract the programmer from the main task. However, do not despair - Perl does not belong to low-level languages, and it is convenient to work with strings.

Perl was designed for word processing. In fact, there are so many textual operations in Perl that they cannot be described in one chapter. Text processing recipes are found in other chapters as well. In particular, refer to Chapter 6 "Pattern Search" and Chapter 8 "File Contents" for interesting techniques not covered in this chapter.

The basic unit for working with data in Perl is a scalar, that is, a separate value stored in a separate (scalar) variable. Scalar variables store strings, numbers, and references. Arrays and hashes are lists or associative arrays of scalars, respectively. References are used for indirect references to other quantities; they are somewhat similar to pointers in low-level languages. Numbers are usually stored in a real format with double precision. Strings in Perl can be arbitrarily long (limited only by your computer's amount of virtual memory) and contain arbitrary data—even binary sequences with zero bytes.

The Perl string is not an array of characters or bytes. A single character cannot be accessed by index as an array element by using the substr function. Rows, like all Perl data types, grow and decrease in size as needed. Unused data is destroyed by the Perl garbage collection system (usually when a variable leaves scope or after evaluating an expression that contains a string). In other words, you can not worry about memory management - this has already been taken care of before you.

The scalar value can be definite or indeterminate. A certain value contains a string, a number, or a reference. The only undefined value is undef, all other values are considered defined - even 0 and an empty string. However, certainty should not be confused with logical truth; to check whether a value is defined, use the defined functions. Logical truth has a special meaning, which is checked by the logical operators && and ||, as well as in the condition of the while block.

Two defined strings are considered false: an empty string ("") and a single-length string containing the number "zero" ("0"). All other defined values (such as "false", 15, or \$x) are true. It may surprise you somewhat that the string value "0" is considered false, but that's because Perl performs conversions between numbers and strings as needed. The values 0., 0.00 and 0.0000000 are numbers and therefore without quotation marks are considered false, since the number "zero" in any guise is false. However, these same three string values ("0.", "0.00", and "0.0000000") become true when they are used in the form of string literals enclosed in quotation marks, or read from a command line, an environment variable, or an input file.

Typically, these differences are not significant because when you use a value in a numeric context, the conversion is automatically performed. But if a value has not already been used in a numerical context, then checking its truth or falsity sometimes leads to an unexpected result - logical checks never force any transformations. Adding 0 to a variable causes Perl to convert the string to a number:

 

print "Gimme a number: ";
0.00000
chomp($n = ); # $n now contains "0.00000";
print "The value $n is ", $n ? "TRUE" : "FALSE", "\n";
The value 0.00000 is TRUE

$n += 0;
print "The value $n is now ", $n ? "TRUE" : "FALSE", "\n";
The value 0 is now FALSE

In a string context, the undef value is interpreted as an empty string (""). In the numerical context, undef is interpreted as 0, and in the reference context, it is interpreted as a null reference. In all cases, it is considered false. Using an indeterminate value where Perl expects to receive a specific value results in a run-time warning in STDERR (unless warnings have been suppressed). A simple check of truth or falsity does not require a specific value, so a warning is not issued in this case. Some operations do not generate warnings when you use variables that contain undefined values. These include the automatic increment and decrease operators, ++ and --, as well as addition and concatenation with assignment, += and .=.

In programs, strings are written in apostrophes or quotation marks, in the form of q// or qq// or "embedded documents" (here-documents). Regardless of the record form you choose, string literals are divided into interpolated and non-interpolated literals. Interpolation refers to the replacement of references to variables and special sequences of characters. In most cases, by default, interpolation is performed - in particular, in templates (/regex/) and when executing commands ($x = 'cmd').

In some situations, individual symbols have a special interpretation. With the \ prefix, any special character becomes a regular character; in other words, it becomes a simple literal. This conversion is commonly referred to as escaping.

In the canonical version of creating non-interpolated string literals, the string is enclosed in an apostrophe. In such lines, only three special sequences are recognized: ' completes the string, \' inserts an apostrophe into it, and \\ inserts a backslash:

$string = '\n'; # Two characters, \ and n $string = 'Jon \'Maddog\' Orwant'; # Internal apostrophe literals

Strings enclosed in quotation marks can interpolate variable names (but not function calls - see recipe 1.15 for how to do this). They also support various service sequences: "\n" (line feed), "\033" (octal character 33), "\cJ" - Ctrl+J, "\x1B" (hexadecimal character 0x1B), etc. For a complete list, see the perlop(1) manual page.

$string = "\n"; # Newline character $string = "Jon \"Maddog\" Orwant"; # Internal quotes

If the string does not contain extensible service sequences or variables, you can use any record you want. When choosing between 'this' and 'this', some Perl programmers prefer the second option to make the lines stand out better. In addition, quotation marks prevent even the slightest probability that the reader of the program will confuse a simple apostrophe with the reverse. For Perl, this is insignificant, but reading the program is simplified.

The operators q// and qq// allow the use of arbitrary bounders with interpolated and non-interpolated literals; they are analogous to strings enclosed in apostrophes and quotation marks, respectively. For example, to write a non-interactive string with internal apostrophes, it is easier to use the q// operator instead of using the escaped \' characters:

$string = 'Jon \'Maddog\' Orwant'; # Inner single quotes $string = q/Jon 'Maddog' Orwant/; # Same thing, but more visual

The separators may be identical characters, as in / in the example given, or any of four combinations of paired bounders (different types of brackets):

$string = q[Jon 'Maddog' Orwant]; # Internal apostrophes $string = q{Jon 'Maddog' Orwant}; # Internal apostrophes $string = q(Jon 'Maddog' Orwant); # Internal apostrophes $string = q<jon 'maddog'="" orwant="">; # Internal apostrophes

The concept of "embedded documents" is borrowed from shells and is intended to define strings that contain large amounts of text. Text can be interpreted according to the rules for strings enclosed in apostrophes or quotation marks, and even as a list of executable commands - depending on how the final identifier is set. Non-interactive embedded documents do not expand the three main service sequences, which expand in literals enclosed in apostrophes. For example, the following inline document will be interpreted according to the rules for lines enclosed in quotation marks:

$a = <<"EOF";
This is a multiline here document
terminated by EOF on a line by itself
EOF

Note that there is no semicolon after the final EOF limiter. The embedded documents are discussed in more detail in Recipe 1.16.

Universal character encoding

From a computer's point of view, any data (even a chain of bits) is a sequence of individual numbers. Even a text string is just a sequence of numeric codes that are interpreted as characters by browsers, email programs, editors, and printing systems.

In those days when memory was scarce, and it cost much more, programmers did real miracles for the sake of saving memory. In practice, techniques such as packing six characters into a single 32-bit word or three characters into a single 16-bit word were often used. Even today, the length of numeric codes used to denote individual characters usually does not exceed 7 or 8 bits (as in ASCII and Latin1 encodings, respectively).

A small number of bits per character means that the number of characters represented is also small. As you know, the palette of a graphic file with an 8-bit color is limited to only 256 colors. Similarly, when storing characters as separate octets (that is, bytes consisting of 8 bits), a document can contain no more than 256 different letters, punctuation marks, and characters.

AsCII (American Standard Code for Information Interchange) did not solve all the problems outside the United States, as it included only the characters of the slightly truncated American dialect of English. Because of this, many countries have developed their own, incompatible 8-bit encodings based on 7-bit ASCII encoding. Conflicting schemes for assigning numeric codes from one limited interval to characters have appeared. This meant that in different systems, the same number could denote different symbols, and different codes could be associated with the same symbol.

One of the first attempts to solve these and other problems due to national and linguistic specificity was local contexts( locales). They still do a good job of not related to character encoding, such as setting up regional settings (currency format, date and time) and even processing merge sequences. But in the realm of using 8-bit space for different encodings, local contexts are much less useful.

If you need to create a document that contains Latin and Greek characters, as well as Cyrillic, you will have big problems because the same numeric code can represent different characters in each of these encodings. For example, the code 196 represents the character D in the ISO 8859-1 (Latin 1) encoding, and in the ISO 8859-7 encoding, this code corresponds to the Greek letter ?. It turns out that the program interpreting the symbol code in ISO 8859-1 will see one character, and in 8859-7 the symbol will be completely different.

Differences in interpretation make it difficult to use different encodings in the same document at the same time. However, even if you somehow manage to combine them, only a few programs will be able to work with the received text. To correctly identify symbols, you need to know from which system they were taken, and this does not allow you to easily move from one system to another. If the guess turns out to be incorrect, instead of a meaningful text, an abracadabra will appear on the screen (and then at best).

Unicode support in Perl

Unicode comes to the aid of the programmer.

This encoding attempts to bring together all the character sets of the entire world, including numerous non-alphabetic characters and even fictional character sets. Unicode allows you to use tens of thousands (and even more) of different characters in a document without any confusion.

All problems with D and ? instantly disappear. The first character, which is formally called the "Latin letter A uppercase with three", is assigned the code U+00C4 (the recommended unicode format). The second character, "Greek uppercase delta", now corresponds to the code U+0394. Different characters always correspond to different codes, which eliminates any conflicts.

Unicode support in Perl has been around since version 5.6, but it's only since version 5.8 that it's become really reliable and practical. At the same time, Perl introduced I/O levels and their programming support; this coincidence is no coincidence. This topic is discussed in more detail in Chapter 8.

All string functions and Perl operators, including those used in template search, now work not with octets, but with symbols. For example, when the length function is called to a string, Perl returns the size of the string in characters instead of bytes. When the substr function extracts the first three characters from a string, the length of the result may be different from three bytes: or it may be equal to three bytes. You don't know that, and it doesn't matter. The basic low-level view shouldn't be looked at too closely at all – if you have to think about it, you're probably looking at it from too close a distance. The choice of view should not affect the programmer's work - and if it does, it may mean that the Implementation of Perl is not yet perfect. We're working on that.

Support for characters with codes greater than 256 means that the chr function argument is no longer limited to 256, and the ord function can return numbers larger than that value. For example, if you query chr(0x394), the uppercase Greek letter "delta" will be returned:

$char = chr(0x394);
$code = ord($char);
printf "char %s is code %d, %#04x\n", $char, $code, $code;
char ? is code 916, 0x394

When you check the length of such a string, you will get 1 because the string contains only one character. Note: we are talking about characters, not the length of the string in bytes. Of course, in the internal representation, such a large numeric code cannot be represented by only 8 bits. But the programmer should work with symbols as abstractions, not as physical octets. All low-level details of this kind are best left to Perl.

Do not consider characters and bytes to be equivalent concepts. By mixing bytes with symbols, you fall into the same sin as C programmers, who flippantly mix integers with pointers. On some platforms, internal representations may coincide, but this is just a random coincidence, and mixing abstract interfaces with physical implementations will sooner or later hit the programmer himself.

There are several ways to include Unicode characters in Perl literals. If your text editor allows you to enter Unicode directly into Perl programs, you can report this using the use utf8 directive. Another method is based on using the \x service sequences in interpolated Perl strings and specifying a hexadecimal character code (for example, \xC4). If the character code is larger than 0xFF, it will take more than two hexadecimal digits to represent it, so these codes must be enclosed in curly braces:

print "\xC4 and \x look different\n";
char D and ? look different

Recipe 1.5 describes how to use character names to include \N constructs in string literals. can be specified as \N{GREEK CAPITAL LETTER DELTA}, \N and even just \N.

To work with Unicode in Perl, this is enough, but for Perl to interact with other programs, you will need something else.

In older single-byte character sets (such as ASCII or ISO 8859-n), the numeric character NN output was one byte with the numeric value NN. The specific output depended on the fonts available, the local context chosen, and a number of other factors. But in Unicode, there is no longer a one-to-one correspondence between logical character codes (code points) and output physical bytes. Logic codes can now be represented in any of several available output formats.

Internally, Perl uses the UTF-8 format, but there are many other output encoding formats for Unicode; Perl can work with these formats as well. The use encoding directive tells Perl what encoding the script itself is written in and what encoding should be used for standard file manipulators. The use open directive specifies the default output encoding for all manipulators. The encoding format for a particular file manipulator is specified using special arguments of the open and binmode functions. The -C command-line switch specifies the encoding for all (or only for standard) manipulators, as well as for the program arguments themselves. The PERLIO environment variables, PERL_ENCODING, and PERL_UNICODE provide Perl with additional information related to this topic.

1.1. Working with substrings

Problem

It is not necessary to obtain or modify an entire line, but only a part of it. For example, you have read a record with a fixed structure and now want to extract individual fields from it.

Decision

The substr function is designed to read and write individual parts of a string:

$value = substr($string, $offset, $count);
$value = substr($string, $offset);

substr($string, $offset, $count) = $newstring;
substr($string, $offset, $count, $newstring); # Equivalent to previous line substr($string, $offset) = $newtail;

The unpack function is restricted to read-only access, but it is faster when extracting multiple substrings:

# Get a 5-byte string, skip 3 bytes,
# then extract two 8-byte strings, then everything else
# (Note: only works with ASCII data, not Unicode)
($leading, $s1, $s2, $trailing) =
     unpack("A5 x3 A8 A8 A*", $data); # Division into groups of five bytes @fivers = unpack("A5" x (length($string)/5), $string); # Divide a string into single-byte characters @chars = unpack("A1" x length($string), $string);

Comment

Perl strings are among the underlying data types; they are not arrays that contain elements of base types. This means that To work with individual characters or substrings instead of indexing, as in other programming languages, Perl uses functions such as unpack or substr.

The second substr argument determines the beginning of the substring you are interested in; positive values are counted from the beginning of the line, and negative values are counted from the end. If the offset is 0, the substring starts at the beginning. The third argument specifies the length of the substring.


$string = "This is what you have"; # +012345678901234567890 Forward indexing (left to right) # 109876543210987654321- Reverse indexing (left to right) 0 corresponds to 10, 20, etc.
$first = substr($string, 0, 1); # "T"
$start = substr($string, 5, 2); # "is"
$rest = substr($string, 13); # "you have"
$last = substr($string, -1); #"e"
$end = substr($string, -4); # "have"
$piece = substr($string, -8, 3); # "you"

However, the substr function allows you not only to view parts of a string, but also to modify them. The fact is that substr belongs to the exotic category of left-handed functions, that is, those to which a value can be assigned when called. The same family includes the vec, pos, and keys functions (and with some imagination, the local, my, and our functions can also be thought of as left-handed).

$string = "This is what you have";
print $string;
This is what you have
substr($string, 5, 2) = "wasn't"; # replace "is" with "wasn't"
This wasn't what you have
substr($string, -12) = "ondrous"; # "This wasn't wondrous"
This wasn't wondrous
substr($string, 0, 1) = ""; # Remove first character
he wasn't wondrous
substr($string, -10) = ""; # Delete the last 10 characters
his wasn't

By using the =~ operator in conjunction with the s///, m//, or tr/// operators, you can make them work only with a specific part of the string:

# Checking for substrings against a pattern
if (substr($string, -10) =~ /pattern/) {
     print "Pattern matches in last 10 characters\n";
}

# Substitute "at" for "is" in the first five characters of the string
substr($string, 0, 5) =~ s/is/at/g;

What's more, substrings can even be swapped using multiple substrings on each side of the assignment:

# Swap the first and last characters of a string
$a = "make a hat";
(substr($a,0,1), substr($a,-1)) = (substr($a,-1), substr($a,0,1));
print $a;
take a ham

Although the unpack function is not left-handed, it is much faster than substr, especially when extracting multiple quantities at the same time. The structure of the retrieved record is determined by a special format string in which the lowercase character "x" with a number passes a specified number of bytes in the forward direction, and the uppercase "X" character in the opposite direction. The "@" symbol moves to the specified offset in bytes within the record. If you are working with Unicode string data, be careful when using these three specifiers: they only work at the byte level, and performing byte operations in multibyte encodings is risky at best.

# Extracting a substring with the unpack function
$a = "To be or not to be";
$b = unpack("x6 A6", $a); # Skip 6 characters, read 6 characters
print $b;
or not

($b, $c) = unpack("x6 A2 X5 A2", $a); # Forward 6, read 2;
                                       # back 5, read 2
print "$b\n$c\n";
or
be

Sometimes the line is "cut" into pieces in certain positions. Suppose you want to set the separation positions before the characters 8, 14, 20, 26 and 30 - in each of the listed columns a new field begins. In principle, the format string unpack is calculated simply - "A7 A6 A6 A4 A *", but the Perl programmer is lazy by nature and does not want to waste time. Let Perl work for him. Use the following cut2fmt function:

sub cut2fmt {
    my(@positions) = @_;
    my $template   = '';
    my $lastpos    = 1;
    foreach $place(positions) {
        $template .= "A" . ($place - $lastpos) . " ";
        $lastpos   = $place;
    }
    $template .= "A*";
    return $template;
}

$fmt = cut2fmt(8, 14, 20, 26, 30);
print "$fmt\n";
A7 A6 A6 A6 A4 A*

The capabilities of the unpack function go far beyond conventional word processing. It can also be used to convert between text and binary data.

This recipe assumes that all characters are represented in 7- or 8-bit encoding, otherwise byte unpack operations will not work as expected.

Cm. likewise

Describe the unpack and substr functions in perlfunc(1); cut2fmt procedure from recipe 1.24. The use of unpack for binary data is demonstrated in recipe 8.24.

1.2. Choosing a Default Value

Problem

You want to assign a default value to a scalar variable, but only if it has not been set explicitly in the program. Quite often, the standard value of a variable is required to be hard-coded in a program, but it can be overridden from the command line or environment variable.

Decision

Use the || operator or || =, which works with both strings and numbers:

# Use $b if $b is true, $c otherwise
$a = $b || $c;

# Set $x to $y, but only if
# if $x is not true
$x ||= $y;

If the variable can take values of 0, "0", and "", use the defined function:

# Use $b if $b is defined, $c otherwise
$a = defined($b) ? $b : $c;

# "New" defined-or operator from a future version of Perl
use 5.9;
$a = $b // $c;

Comment

The main difference between these two solutions (defined and ||) is, first of all, in what exactly is checked - certainty or truth. In the world of Perl, three specific values are false: 0, "0", and "". If your variable contains one of these values, but you don't want to change it, || will not work - you have to perform clumsy checks with defined. It is often convenient to organize the program so that the truth or falsity of the variables, rather than their certainty, is taken into account.

Unlike other languages, where return values are limited to 0 and 1, in Perl the operator || has a more interesting property: it returns the first (left) operand if it has a true value; otherwise, the second operand returns. The && operator behaves similarly (for the second expression), but this fact is used less frequently. For these operators, it is immaterial whether their operands are strings, numbers, or references; any scalar value will do. They simply return the first operand, which makes the entire expression true or false. Perhaps this is at odds with the return value in the sense of Boolean algebra, but such operators are more convenient to use.

 

This allows you to set a default value for a variable, function, or longer expression if the first operand does not fit. The following is an example of using a || in which the $foo is assigned either $bar or, if the value of the $bar false, the string "DEFAULT VALUE":

$foo = $bar || "DEFAULT VALUE"

In another example, the variable $dir is assigned either the first argument of the program command line or "/tmp" if no argument is specified:

$dir = shift(@ARGV) || "/tmp"

The same can be done without changing the @ARGV:

$dir = $ARGV[0] || "/tmp"

If 0 is a valid value $ARGV[0], use || you can't, because a perfectly normal value would be interpreted as false. You have to turn to the ternary operator of the choice:

$dir = defined($ARGV[0]) ? shift(@ARGV) : "/tmp";

The same can be written in another way, with slightly modified semantics:

$dir = @ARGV ? $ARGV[0] : "/tmp"

We check the number of items in the @ARGV. In the condition of the choice operator (?:) @ARGV is interpreted in a scalar context. The value will be false only if there are zero elements, in which case "/tmp" will be used. In other situations (that is, with the specified arguments), the first command-line argument will be assigned to the variable.

The following line increments the %count value, using the $shell value as the key, and if it is false, using the string "/bin/sh":

$count{ $shell || "/bin/sh" }++;

Several alternatives can be combined in a single condition, as the following example shows. The result of an expression is determined by the first operand that has a true meaning.

# Determine the username on a Unix system
$user = $ENV
      || $ENV
      || getlogin()
      || (getwuid($<))[0]
      || "Unknown uid number $<";

Operator && works similarly; he returns his first operand if that operand is false. Otherwise, the second operand returns. Because false values are of interest much less frequently than true values, this property is not used as often. Some possible uses are demonstrated in recipes 13.12 and 14.19.

|| assignment operator = looks weird, but works exactly like the rest of the assignment operators. For almost all Perl binary operators, $VAR OP= VALUE stands for $VAR = $VAR OP VALUE; for example, $a += $b is the same as $a = $a + $b. Therefore, the operator || = can be used to assign a variable whose value is interpreted as false. Since || performs a simple logical check (true or false), it has no problem with undefined values, even if warnings are enabled. The following example || = assigns the variable $starting_point to the string "Greenwich" if the value of the variable has not been set before. It is assumed that the $starting_point does not accept values of 0 or "0", and if it does, such values should be replaced.

$starting_point ||= "Greenwich"

In || assignment operators cannot be replaced with the or operator because or has too low a priority. The expression $a = $b or $c is equivalent ($a = $b) or $c. In this case, the $b variable is always assigned $a, and this is not at all what you wanted.

Don't try to spread this curious application || and || = from scalar values to arrays and hashes. You won't succeed because the left operand is interpreted in a scalar context. Instead, you have to do something like this:

@a = @b unless @a; # Copy if array is empty
@a = @b ? @b : @c; # Assign @b if it's not empty, otherwise @c

It is expected that someday New //, //= and err operators will be supported in Perl. This may have already happened by the time you read this book. The new operators will work similarly to the || operator, but instead of being true, they will test the certainty of variables, so the following pairs will become equivalent:


$a = defined($b) ? $b : $c;
$a = $b // $c;

$x = defined($x) ? $x : $y;
$x //= $y;

defined(read(FH, $buf, $count)  or  die "read failed: $!";
read(FH, $buf, $count)          err die "read failed: $!";

These three operators are already included in Perl version 5.9. Like all odd-numbered versions, version 5.9 is experimental, so it is not recommended for use in a real-world environment. Apparently, operators will remain in the stable version 5.10 and will certainly be supported in version 6, the release date of which remains uncertain.

Cm. likewise

description of the || operator in perlop(1); describes the defined and exists functions in perlfunc(1).

1.3. Rearranging values without using temporary variables

Problem

You want to change the values of two scalar variables, but you don't want to use a temporary variable.

Decision

Use the assignment from the list:

 ($VAR1, $VAR2) = ($VAR2, $VAR1);

Comment

In most programming languages, permutation of the values of two variables requires intermediate assignment:


$temp = $a;
$a    = $b;
$b    = $temp;

That's not the case at Perl. The language monitors both sides of the assignment and ensures that no value is accidentally erased. This allows you to get rid of temporary variables:

$a = "alpha";
$b = "omega";
($a, $b) = ($b, $a); # First becomes last - and vice versa

In a similar way, you can swap several variables at once:

($alpha, $beta, $production) = qw(January March August);
# beta is moved to alpha,
# production - in beta,
# alpha - in production
($alpha, $beta, $production) = ($beta, $production, $alpha);

The values of the variables $alpha, $beta, and $production after this fragment is completed will be "March", "August", and "January", respectively.

Cm. likewise

"List value constructors" section perlop(1).

1.4. Conversion between characters and ASCII codes

Problem

You want to output a code that corresponds to a certain character in ASCII encoding, or vice versa - a character by ASCII code.

Decision

Use the ord function to convert a symbol to a numeric code, or use a function to convert a numeric code to a symbol:chr

$num  = ord($char);
$char = chr($num);

The %c format in the printf and sprintf functions also converts a number to a character:

$char = sprintf("%c", $num); # Slower than chr($num)
printf("Number %d is character %c\n", $num, $num);
Number 101 is character e

The C* pattern used in the pack and unpack functions allows you to quickly convert multiple 8-bit characters; for Unicode characters, use the U* pattern:


@bytes  = unpack("C*", $string);
@string = pack("C*", @bytes);

$unistr = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
@unichars = unpack("U*", $unistr);

Comment

Unlike low-level, untyped languages like assembler, Perl does not consider characters and numbers to be equivalent; strings and numbers are considered equivalent. This means that you cannot arbitrarily assign a numeric representation of a symbol to it, or vice versa. To convert between characters and their numeric values in Perl, there are chr and ord functions taken from Pascal:

$value = ord("e"); # Now 101
$character = chr(101); # Now "e"

The character is actually a single-length string, so you can simply output it by the print function or by using the %s format of the printf and sprintf functions. The %c format causes printf or sprintf to convert a number to a character, but it does not output a character that is already stored in a character format (that is, as a string).

printf("Number %d is character %c\n", 101, 101);

The pack, unpack, chr, and ord functions are faster than sprintf. The following are examples of practical applications of pack and unpack:

@ascii_character_numbers = unpack("C*", "sample");
print "@ascii_character_numbers\n";
115 97 109 112 108 101

$word = pack("C*", ascii_character_numbers);
$word = pack("C*", 115, 97, 109, 112, 108, 101); # Same
print "$word\n"
sample

Here's how to turn HAL into IBM:

$hal = "hal";
@byte = unpack("C*", $hal);
foreach $val (@byte) {
     $val++; # Increments each ASCII code by 1
}
$ibm = pack("C*", @byte);
print "$ibm\n"; # Displays "IBM"

For single-byte character data (such as the good old ASCII encoding or any sets of the ISO 8859 family), the ord function returns numbers from 0 to 255. This range corresponds to the C unsigned char data type.

However, Perl is not limited to this: it has integrated support for universal Unicode encoding. If you call chr, sprintf "%c", or pack "U*" and pass values greater than 255, the result will be a Unicode string.

The following is an analogue of the previous Unicode snippet:


@unicode_points = unpack("U*", "fac\xade");
print "@unicode_points\n";
102 97 99 807 97 100 101

$word = pack("U*", @unicode_points);
print "$word\n";
faзade

If you only want to output character codes, you might not even have to use unpack. The Perl printf and sprintf functions have a modifier v that works as follows:

printf "%vd\n", "fac\xade";
102.97.99.807.97.100.101

printf "%vx\n", "fac\xade";
66.61.63.327.61.64.65

Functions output numeric codes of all characters in a string (in Unicode terminology , "code points"), separated by periods.

Cm. likewise

Describes the chr, ord, printf, sprintf, pack, and unpack functions in perlfunc(1).

1.5. Using Unicode Named Characters

Problem

It is required to denote non-standard characters in the program by name, without fussing with their numeric codes.

Decision

Include the use charnames directive at the beginning of the file, and then freely include the "\N" service sequences in the string literals.

Comment

The use charnames directive allows you to use symbolic names for Unicode characters. Names are compile-time constants that are accessed using service sequences of the form \N. A number of subdirects are supported for the use charnames directive. The :full subdirective gives access to the entire range of character names, but you will have to write them down completely and exactly as they are stored in the Unicode character database (in particular, names must be written in uppercase). The :short subdirective allows you to use convenient abbreviations. An imported name without the ":" prefix is treated as an alphabet name, which makes it possible to use case-sensitive abbreviated character names for the specified alphabet.

use charnames ':full';
print "\N{GREEK CAPITAL LETTER DELTA} is called delta.\n";

? is called delta.

use charnames ':short';
print "\N is an upper-case delta.\n";

? is an upper-case delta.

use charnames qw(cyrillic greek);
print "\N and \N are Greek sigmas.\n";
print "\N and \N are Cyrillic bes.\n";

? and ? are greek sigmas.
Б and б are Cyrillic bes.

The charnames::viacode and charnames::vianame functions perform conversions between numeric code points and long names. In the Unicode documentation, the character with the code point XXXX is denoted as U+XXXX, so we'll also use this symbol when displaying data in the following example:


use charnames qw(:full);
for $code (0xC4, 0x394) {
    printf "Character U+%04X (%s) is named %s\n",
        $code, chr($code), charnames::viacode($code);
}

Character U+00C4 (Д) is named LATIN CAPITAL LETTER A WITH DIAERESIS
Character U+0394 (?) is named GREEK CAPITAL LETTER DELTA

use charnames qw(:full);
$name = "MUSIC SHARP SIGN";
$code = charnames::vianame($name);
printf "%s is character U+%04X (%s)\n",
    $name, $code, chr($code);

MUSIC SHARP SIGN is character U+266F ( # )

The name of the Unicode character database copy in Perl is defined as follows:

% perl -MConfig -le 'print "$Config/unicore/NamesList.txt"'
/usr/local/lib/perl5/5.8.1/unicore/NamesList.txt

From this file, you can find out the available character names.

Cm. likewise

Charnames(3); Unicode character database at http://www.unicode.org.

1.6. Character-by-character processing of strings

Problem

You want to process a string one character at a time.

Decision

Use the split function with a blank template to split a string into separate characters, or the unpack function if you only need character codes:

@array = split(//, $string); # List contains individual characters
@array = unpack("U*", $string); # List contains code points (numbers)

You can also sequentially highlight the next character in the loop:

while (/(.)/g) { # Character . does not match the newline character
    # Variable $1 contains the symbol, ord($1) - the numeric code of the symbol.
}

Comment

As mentioned above, the basic unit of text in Perl is a string, not a character. The need for character-by-character processing of strings arises quite rarely. Typically, such tasks are easier to solve with the help of high-level Perl operations (for example, template search). For an example, see recipe 7.14, which uses lookups to find command-line arguments.

If you call split with a pattern that matches an empty string, the function returns a list of the individual characters in the string. With intentional use, this feature is convenient, but you can also encounter it by chance. For example, /X*/ matches any string, including an empty string. It is possible that you will meet other unintentional coincidences.

The following example displays the characters in the string "an apple a day", sorted in ascending order:

%seen = ();
$string = "an apple a day";
foreach $char (split //, $string) {
    $seen++;
}
print "unique chars are: ", sort(keys %seen), "\n";
unique chars are: adelnpy

Solutions with split and unpack functions provide an array of characters with which you can perform further operations. If you don't need an array, use a pattern search in a while loop with the /g flag that extracts one character from the string:

%seen = ();
$string = "an apple a day";
while ($string =~ /(.)/g) {
    $seen++;
}
print "unique chars are: ", sort(keys %seen), "\n";
unique chars are: adelnpy

Typically, character-by-character string processing is not the optimal solution. Sometimes instead of using index/substr or split/unpack, it's easier to use a template. In the following example, the 32-bit checksum is calculated manually, but it is better to entrust the work of the unpack function - it will do the same much more efficiently.

The following example calculates the checksum of $string symbols in a foreach loop. The above algorithm is not optimal; it's just that we use a traditional and relatively easy to calculate amount. If you need more advanced checksum calculations, use the standard Digest::MD5 module.

$sum = 0;
foreach $byteval (unpack("C*", $string)) {
     $sum += $byteval;
}
print "sum is $sum\n";
# For the string "an apple a day" the sum is 1248

The following option does the same thing, but much faster:

$sum = unpack("%32C*", $string);

This allows you to emulate the SysV checksum calculation program:

#!/usr/bin/perl
# sum - calculate 16-bit checksum of all input files
$checksum = 0;
while (<>) { $checksum += unpack("%16C*", $_) }
$checksum %= (2 ** 16) - 1;
print "$checksum\n";

In fact, it looks like this:

% perl sum /etc/termcap
1510

If you have the GNU version of sum installed, you must call it with the -sysv option to get an identical answer for the same file:

% sum -sysv /etc/termcap
1510 851 /etc/termcap

Example 1.1 shows another tiny program that also implements character-by-character processing of input data. The idea is that the output of each character is accompanied by a small pause – the text is displayed in front of readers in slow motion to make it easier to read.

Example 1.1. slowcat
#!/usr/bin/perl
# slowcat - slow output
# usage: slowcat [-DELAY] [files...],
# where DELAY - delay
$DELAY = ($ARGV[0] =~ /^-([.\d]+)/) ? (shift, $1) : 1;
$| = 1;
while (<>) {
     for (split(//)) {
         print;
         select(undef,undef,undef, 0.005 * $DELAY);
     }
}

Cm. likewise

Describes the split and unpack functions in perlfunc(1); the use of select for delay is explained in recipe 3.10.

1.7. Reverse permutation of words or symbols

Problem

You want to reverse the order of the characters or words in the string.

Decision

To rearrange bytes, use the reverse function in the scalar context:

$revchars = reverse($string); 

To rearrange words, use reverse in a list context with split and join functions:

$revwords = join(" ", reverse split(" ", $string);

Comment

The reverse function has two uses: In a scalar context, the function combines arguments and returns the resulting string in reverse order. In a list context, the function returns its arguments in reverse order. If you are using the reverse function to rearrange characters in a non-obvious situation, use the scalar function to force the scalar context.

$gnirts = reverse($string); # Permutation of characters $string

$sdrow = reverse(@words); # Swapping @words elements

$confused = reverse(@words); # Rearrange letters in join("", @words)

Consider an example of reverse permutation of words in a string. Calling the split function with the " " pattern is a special case: it forces split to use adjacent whitespaces as a separator and to discard the initial empty fields (similar to awk). Typically, split drops only the final empty fields.


# Reverse permutation of words
$string = 'Yoda said, "can you see this?"'; @allwords = split(" ", $string); @revwords = join(" ", reverse @allwords); print $revwords, "\n"; this?" see you "can said, Yoda

The temporary array of @allwords can be removed and everything can be done in one line:


$revwords = join(" ", reverse split(" ", $string);

Adjacent $string gaps become one gap in the $revwords. To save your existing gaps, do the following:

$revwords = join("", reverse split (/(\s+)/, $string));

Using the reverse function, you can check whether the word is a palindrome (that is, whether it is read equally in both directions):


$word = "reviver";
$is_palindrome = ($word eq reverse($word));

The program for searching for long palindromes in the /usr/dict/words file is written in one line:


% perl -nle 'print if $_ eq reverse && length >5' /usr/dict/words
 deedeed
 degged
 deified
 denned
 hallah
 kakkak
 murdrum
 redder
 repaper
 retter
 reviver
 rotator
 sooloos
 tebbet
 terret
 tut-tut

Cm. likewise

Describes the split, reverse, and scalar functions in perlfunc(1); Recipe 1.8.

1.8. Interpreting Combined Unicode Characters as Single Characters

Problem

The Unicode string contains combined characters. You want to interpret each of the combined sequences as a single logical symbol.

Decision

Process the string using a regular expression with the metacharacter \X:

$string = "fac\xade"; # facade
$string =~ "fa.ade"; # failure
$string =~ "fa\Xade"; # Coincidence

@chars = split(//, $string); # 7 letters in @chars
@chars = $string =~ /(.)/g; # Same
@chars = $string =~ /(\X)/g; # 6 "letters" in @chars

Comment

In Unicode, the base characters can be combined with one or more zero-width characters following it (usually all sorts of diacritics: accents, saddles, tildes, etc.). Mainly to support older character systems, there are two options for writing characters.

For example, the word "fazade" can be written so that between the two letters "a" there is one character "\x" from the Latin1 encoding (ISO 8859-1). Perhaps in the UTF-8 encoding used in the inner workings of Perl, these two characters are represented by a double-byte sequence, but these two bytes are still interpreted as a separate character.

However, there is another way to record. The symbol U+00E7 can be represented by two code points: the usual letter "c" followed by "\x". The code point U+0327 corresponds to the combination symbol of zero width, which means that there must be a cedile below the previous base symbol.

Sometimes it is necessary for Perl to interpret each combined symbol as one logical symbol. But because the combined character is represented by multiple code points, Perl character operations (including the substr and length functions, as well as the regular expression metacharacters /./ and /[^abc]/) interpret the zero-width combination characters as stand-alone characters.

In a regular expression, the \X metacharacter coincides with the sequence that defines the combined Unicode character. It is exactly equivalent to the construction (?:\ PM\pM*), or in an extended entry:

(?x: # Start of non-preserving group
       \PM # Single character without property M (sign)
                        # (for example, a letter);
            \pM # single character with property M (sign)
                        # (for example, accent),
            * # which can be repeated any number of times
)

Without the \X metacharacter, the presence of these nasty combinations in a string thoroughly confuses even the simplest operations. Consider an example of reverse permutation of word characters from the previous recipe. In the combined notation, the words "annje" and "niсo" are represented in Perl as ."anne\xe" и "nin\xo"


for $word ("anne\xe", "nin\xo") {
    printf "%s simple reversed to %s\n", $word,
        scalar reverse $word;
    printf "%s better reversed to %s\n", $word,
        join("", reverse $word =~ /\X/g);
}

The result looks like this:

anne simple reversed to enna
anne better reversed to enna
none simple reversed to xnin
nico better reversed to ocin

In primitive permutations of the first and third lines, the diacritic jumped from one base character to another. The fact is that the Combination character always follows its base symbol, and we have rearranged all the characters in the string. By capturing the entire sequence from the base symbol and all the combination characters following it, we get rid of this problem when the list items are rearranged later.

Cm. likewise

Perlre(1) and perluniintro(1); Recipe 1.9.

1.9. Casting strings with combined Unicode characters to canonical form

Problem

Two strings look the same when inferred, but when equality is checked, they are considered different, and sometimes even have different lengths. How do you get Perl to think these lines are the same?

Decision

If at least some of the strings being compared contain combined Unicode characters, then the comparison should use the results of the NFD()module function processing these strings: Unicode::Normalize


use Unicode::Normalize;
$s1 = "fa\xade";
$s2 = "fac\xade";
if (NFD($s1) eq NFD($s2)) { print "Yup!\n" }

Comment

The same characters can be defined in different ways in some cases. Sometimes this happens when using older encodings - for example, letters with diacritics from the Latin1 encoding. Such letters are specified either directly as a separate symbol (for example, U+00E7, the lowercase Latin letter "c" with a saddle) or indirectly, as a combination of the base symbol (U+0063, lowercase Latin letter "c") with a combination character (U+0327, cedile).

Another option is if the base symbol is followed by two or more characters, which can follow in different order. Let's say you want to use the symbol "c" with a saddle and a crown to display the symbol . Such a symbol can be defined in several ways:

$string = v231.780;
# Latin lowercase letter C with cedilla
# Combination crown
$string = v99.807.780;
# Latin lowercase letter C
# Combination crown
# Combination cedilla

$string = v99.780.807;
# Latin lowercase letter C
# Combination cedilla
# Combination crown

Normalization functions bring these options into a single order. There are several such functions, including the NFD() function to perform canonical decomposition and the NFC() function to perform canonical decomposition followed by canonical composition. Whichever of the three options is chosen to define a character, NFD always returns v99.807.780 and NFC returns v321.780.

Sometimes it is more convenient to use the functions NFKD() and NFKC(), similar to the previous functions, but unlike them performing a compatible decomposition, followed by a canonical composition in the case of NFKC(). For example, \x defines the ligature !!!?!!!. The NFD and NFC forms return the same string "\x", but the NFKD and NFKC forms return a string of two characters "\x\x".

Cm. likewise

The "Universal Character Encoding" section at the beginning of this chapter; Unicode documentation Unicode module::Normalize; Recipe 8.20.

1.10. Interpreting a Unicode String as a Sequence of Octets

Problem

You want to interpret a Unicode string in Perl as a sequence of octets (for example, to calculate its length or in the I/O context).

Decision

The use bytes directive causes all Perl operations in its lexical scope to interpret a string as a group of octets. Use it when calling Perl character functions:

$ff = "\x"; # Ligature ff
$chars = length($ff); # Length is one character
{
   use bytes; # Enforce byte semantics
   $octets = length($ff); # Length is two octets
}
$chars = length($ff); # Return to symbolic semantics

There is another solution: the Encode module allows you to convert a Unicode string to an octet string and vice versa. Use it if the code with symbolic semantics is not in the lexical scope:

use Encode qw(encode_utf8);

subsomefunc; # Defined elsewhere

$ff = "\x"; # Ligature ff
$ff_oct = encode_utf8($ff); # Convert to octets

$chars = somefunc($ff); # The function works with a character string
$octets = somefunc($ff_oct); # The function works with a string of octets

Comment

As explained in the introduction of this chapter, Perl distinguishes between two varieties of strings: strings consisting of simple uninterpreted octets, and strings consisting of Unicode characters, in which the UTF-8 representation may require more than one octet. A flag is associated with each specific string, identifying it as an octet string or a UTF-8 string. Perl string functions (such as length) and I/O check flag status and apply symbolic or octet semantics based on the validation result.

Sometimes you have to work with bytes instead of characters. For example, in many protocols, there is a Content-Length header that defines the size of the message body in octets. Simply calculating the size of length is not suitable - if the string for which length is called is marked as a UTF-8 string, you will get the size in characters.

The use bytes directive forces all Perl functions in the lexical scope to use octet semantics instead of character semantics in string operations. Under the influence of this directive, length always returns the length of a string in octets, and the read function always returns the number of octets read. However, the use bytes directive has lexical visibility, so it cannot be used to affect the operation of code in another scope (for example, a function written by someone else).

In this case, you will have to create a copy of the UTF-8 string transcoded into octets. Of course, in memory, both rows will contain the same sequence of bytes. The only difference is that the UTF-8 flag is dropped in the octet copy of the string. Functions that work with an octet copy will always use object semantics, regardless of where they are.

The no bytes directive enforces the use of character semantics, and the decode_utf8 function converts an octet string to a UTF-8 string. However, in practice they are used less frequently, because not every octet string is a valid UTF-8 string, whereas all UTF-8 strings are valid octet strings.

Cm. likewise

Documentation on the use bytes directive; Documentation for the standard Encode module.

1.11. Expanding and Compressing Tab Characters

Problem

You want to replace the tab characters in the string with the appropriate number of spaces, or vice versa. Replacing spaces with tabs reduces the size of files that contain many contiguous spaces. You might want to convert tab characters to spaces when displaying to devices that do not accept tab characters or assume that they are at different positions.

Decision

Apply a very strange looking substitution:

while ($string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e) {
     # Run an empty loop until
     # as long as the substitution condition is met
}

You can also use the standard Text::Tabs module:


use Text::Tabs;
@expanded_lines  = expand(@lines_with_tabs);
@tabulated_lines = unexpand(@lines_without_tabs);

Comment

If tab stops are followed every N characters (where N is usually 8), they are easy to convert to spaces. The standard, "book" method does not use the Text::Tabs module, but it is not easy to understand it. In addition, it uses the variable $', the mere mention of which slows down the search for a pattern in the program. The reason is explained in the "Special Variables" section of Chapter 6. The following algorithm replaces each tab character in the input with eight spaces:


while (<>) {
    1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
    print;
}

To avoid $', you can use a more complex solution in which parts of the match are stored in named variables. The following loop replaces one tab character with four spaces instead of eight:


1 while s/^(.*?)(\t+)/$1 . ' ' x (length($2) * 4 - length($1) % 4)/e;

Another technique is based on the direct use of offsets from the @+ and @- arrays. The following example also expands the tab character to four spaces:


1 while s/\t+/' ' x (($+[0] - $-[0]) * 4 - $-[0] % 4)/e;

You look at all these 1 while loops and can't understand why they couldn't be written as a simple s///g construct? Because we have to recalculate the length from the beginning of the string each time, not from the last match.

The construct 1 while CONDITION is equivalent to while {}, but is more compact. It appeared in those days when the first design worked in Perl incomparably faster than the second. Although now the second option is almost not inferior in speed, the convenient first option has become a habit.

The standard Text::Tabs module contains transformation functions in both directions and exports a variable $tabstop that specifies the number of spaces per tab character. In addition, the use of the module does not lead to a decrease in performance, because instead of $& and $' the variables $1 and $2 are used:


use Text::Tabs;
$tabstop = 4;
while (<>) { print expand($_) }

The Text::Tabs module can also be used to "roll up" tabs. The following example uses a default $tabstop value of 8:


use Text::Tabs;
while (<>) { print unexpand($_) }

Cm. likewise

Text::Tabs module manual page; description of the s/// operator in perlre(1) and perlop(1).

1.12. Extending variables in the input data

Problem

The program reads a line inside which there is a reference to the variable:

You owe $debt to me.

You want to replace the name of the variable $debt in the string with its current value.

Decision

If all variables are global, use wildcard with symbolic links:

$text =~ s/\$(\w+)/$/g;

But if lexical (my) variables can occur among the variables, you should use /ee:

$text =~ s/(\$\w+)/$1/gee;

Comment

The first method actually boils down to this: we look for something similar to the name of a variable, and then interpolate its value through symbolic dereferencing. If $1 contains the string somevar, then $ will be equal to the contents of the $somevar. This option would not work with the current use strict 'refs' directive because it prohibits symbolic dereferencing.

Here's an example:

our($rows $cols);
no strict 'refs'; # for the $/g below
my $text;

($rows, $cols) = (24, 80);
$text = q(I am $rows high and $cols long); # As a single-quoted string!
$text =~ s/\$(\w+)/$/g;
print $text;
I am 24 high and 80 long

You may have seen the /e lookup modifier used to evaluate a replacement expression instead of a string. Let's say you need to double each integer in a line:

$text = "I am 17 years old";
$text =~ s/(\d+)/2 * $1/eg;

Before running the program, when Perl encounters /e when it is substituted, it compiles the replacement expression code along with the rest of the program, long before the actual substitution. When you perform a lookup, $1 is replaced with the found string. In our example, the following expression will be calculated:

2 * 17

But if you try to run the following snippet:

$text = 'I am $AGE years old';  # Pay attention to the apostrophes!
$text =~ s/(\$\w+)/$1/eg; # НЕВЕРНО

provided that the $text contains the name of the $AGE variable, Perl will obediently replace $1 with $AGE and calculate the following expression:

'$AGE'

This brings us back to the original line. To get the value of a variable, you must calculate the result again. To do this, another modifier /e is added to the line:

$text =~ s/(\$\w+)/$1/eeg;      # Finds variables my()

Yes, the number of /e modifiers can be anything. Only the first modifier is compiled with the program and checked for correct syntax. As a result, it works similarly to the eval construct, although it does not catch exceptions. Perhaps it is better to draw an analogy with do .

The rest of the /e modifiers behave differently and are more like the eval "STRING" construct. They are not compiled until the program is executed. The small advantage of this scheme is that you do not have to insert the no strict 'refs' directive into the block. There is another, huge advantage: this mechanism allows you to find lexical variables created with my - symbolic references are not capable of this.

In the following example, the /x modifier allows skips and comments in the lookup template, and the /e modifier evaluates the right-handed expression programmatically. The /e modifier allows you to better control the handling of errors or other emergency situations:

# Expand variables in $text. If the variable is not defined,
# insert an error message.
$text =~ s{
      \$ # Find dollar sign
     (\w+) # Find "word" and store it in $1
}{
     no strict 'refs'; # For $$1
     if (defined $) {
         $; # Expand only global variables
     } else {
         "[NO VARIABLE: \$$1]"; # Error message
     }
}egx;

In time immemorial, the expression $$1 in strings denoted $1, which is the $$ variable followed by 1. This interpretation was accepted for the convenience of extending the $$ variable as a process identifier (PID) when naming temporary files. Right now, $$1 always means $, which is dereferencing the contents of the $1 variable. In the above program, the specified entry is used only for clarity, since the program already works correctly.

Cm. likewise

Description of the s/// operator in perlre(1) and perlop(1); description of the eval function in perlfunc(1). A similar use of substitutions is found in recipe 20.9.

1.13. Register Conversion

Problem

A string with uppercase characters must be converted to lowercase or vice versa.

Decision

Use the lc and uc functions or the \L and \U modifiers:


$big = uc($little);         # "bo peep" -> "BO PEEP"
$little = lc($big);         # "JOHN"    -> "john"
$big = "\U$little";         # "bo peep" -> "BO PEEP"
$little = "\L$big";         # "JOHN"    -> "john"

To replace an individual character, use the lcfirst and ucfirst functions or the \l and \u modifiers:


$big = "\u$little";         # "bo"      -> "Bo"
$little = "\l$big";         # "BoPeep"  -> "boPeep"

Comment

Functions and modifiers look different, but do the same thing. You can change the case of both the first character and the whole string. You can even combine both solutions and convert the first character to uppercase (or rather, make it uppercase - see Comment) and all other characters to lowercase.

$beast = "dromedary";
# Change the case of different characters $beast
$capit = ucfirst($beast); # Dromedary
$capit = "\u\L$beast"; # (too)
$capall = uc($beast); # DROMEDARY
$capall = "\U$beast"; # (too)
$caprest = lcfirst(uc($beast)); # dROMEDARY
$caprest = "\l\U$beast"; # (too)

Typically, modifiers provide a uniform style for applying the case in the string:


# Make the first character of each word capitalized,
# and the rest of the characters to lowercase
$text = "tHIS is a loNG liNE"; $text =~ s/(w+)/\u\L$1/g; print $text; This Is A Long Line

They can also be used to compare case-insensitive strings:

if (uc($a) eq uc($b)) { print "a and b are the same\n"; }

The randcap program in Example 1.2 randomly converts approximately 20 percent of the characters entered to uppercase. Using it, you will be able to freely communicate with 14-year-old kRutym HaTskeri:

Example 1.2. randcap
#!/usr/bin/perl -p
# randcap: a filter that randomly
# uppercase 20% of characters
# As of version 5.4, calling srand() is optional.
BEGIN { srand(time() ^ ($$ + ($$ << 15))) }
sub randcase { rand(100) < 20 ? "\u$_[0]" : "\l$_[0]" }
s/(\w)/randcase($1)/ge;

% randcap < genesis | head -9
boOk 01 genesis

001:001 in the BEginning goD created the heaven and tHe earTH.

001:002 and the earth wAS without ForM, aND void; And darkneSS was
         upon The Face of the dEEp. an the spIrit of GOd movEd upOn
         tHe face of the Waters.

001:003 and god Said, let there be ligHt: and therE wAs LigHt.

Some languages distinguish between uppercase characters and titlecase characters. In such cases, the ucfirst() function (and its modifier counterpart\u) converts characters to capital characters. For example, in Hungarian there is a sequence "dz". In the upper case, it is written as "DZ", in the uppercase - "Dz", and in the lower - "dz". Accordingly, Unicode provides three different characters for these three cases:

Code pointRecordName
01F1DZLATIN CAPITAL LETTER DZ
01F2DzLATIN CAPITAL LETTER D WITH SMALL LETTER Z
01F3dzLATIN SMALL LETTER DZ

Register transformations with constructs like tr[a-z][A-Z] or something like that are tempting, but this is not recommended. This decision is erroneous, since all characters with umlauts, saddles and other diacritic elements found in many languages (including English) fall out of it. However, the task of correctly displaying the case in character data with diacritics is generally much more difficult than it seems at first glance. There is no simple solution, but if all the data is stored in Unicode, things are not so bad, because the register functions of Perl work flawlessly with Unicode. For more information, see the "Universal Character Encoding" section in the introduction to this chapter.

Cm. likewise

Describes the uc, lc, ucfirst, and lcfirst functions in perlfunc(1); description of modifiers \L, \U, \l and \u in the section "Quote and Quote-like Operators" perlop(1).

1.14. Capitalization of headings

Problem

There is a line with the title of the article, the title of the book, etc. It is required to correctly place the capital letters in it.

Decision

Use a variation of the tc() function:


INIT {
    our %nocap;
    for(qw(
            a an the
            and but or
            as at but by for from in into of off on onto per to with
        ))
    {
        $nocap++;
    }
}

sub tc {
    local $_ = shift;

    # Start with a lowercase letter if the word is in the list,
    # otherwise capitalize.
    s/(\pL[\pL']*)/$nocap ? lc($1) : ucfirst(lc($1))/ge;

    s/^(\pL[\pL']*) /\u\L$1/x; # Always last word
                               # starts with a capital letter
    s/ (\pL[\pL']*)$/\u\L$1/x; # Always last word
                               # starts with a capital letter

    # The part in parentheses is interpreted as the full name
    s/\( (\pL[\pL']*) /(\u\L$1/x;
    s/(\pL[\pL']*) \) /\u\L$1)/x;

    # First word after colon or semicolon
    # starts with a capital letter
    s/ ( [:;] \s+ ) (\pL[\pL']* ) /$1\u\L$2/x;

    return $_;
}

Comment

The rules for placing capital letters in English titles and titles are more complicated than it may seem at first glance. If they were to simply replace the first letter of each word, then the problem would be solved by substituting the form

s/(\w+\S*\w*)/\u\L$1/g;

Most style guides recommend starting with uppercase letters for the first and last word in the title, as well as all other words except articles, the "to" particle in the infinitive, conjunctive conjunctions, and prepositions.

The following example demonstrates the distinctive features of capital symbols (it uses the tc() function from the above Solution):

@data = (
            "the enchantress of \xur mountain",
    "meeting the enchantress of \xur mountain",
    "the lord of the rings: the fellowship of the ring",
);

$mask = "%-20s: %s\n";

sub tc_lame {
    local $_ = shift;
    s/(\w+\S*\w*)/\u\L$1/g;
    return $_;
}

for $datum (@data) {
    printf $mask, "ALL CAPITALS", uc($datum);
    printf $mask, "no capitals", lc($datum);
    printf $mask, "simple titlecase", tc_lame($datum);
    printf $mask, "better titlecase", tc($datum);
    print "\n";
}

ALL CAPITALS : THE ENCHANTRESS OF DZUR MOUNTAIN
no capitals : the enchantress of dzur mountain
simple titlecase : The Enchantress Of Dzur Mountain
better titlecase : The Enchantress of Dzur Mountain

ALL CAPITALS : MEETING THE ENCHANTRESS OF DZUR MOUNTAIN
no capitals : meeting the enchantress of dzur mountain
simple titlecase : Meeting The Enchantress Of Dzur Mountain
better titlecase : Meeting the Enchantress of Dzur Mountain

ALL CAPITALS : THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING
no capitals : the lord of the rings: the fellowship of the ring
simple titlecase : The Lord Of The Rings: The Fellowship Of The Ring
better titlecase : The Lord of the Rings: The Fellowship of the Ring

It is also worth considering that some style guides recommend starting with capital letters of prepositions longer than three, four, and in some cases - five characters. So, according to the rules of the publishing house "O'Reilly & Associates", prepositions of four or fewer characters are written with a lowercase letter. Below is an extended list of prepositions; change it as you see fit:

@all_prepositions = qw{
    about above absent across after against along amid amidst
    among amongst around as at athwart before behind below
    beneath beside besides between betwixt beyond but by circa
    down during ere except for from in into near of off on onto
    out over past per since than through till to toward towards
    under until unto up upon versus via with within without
};

But this solution is not ideal, because it does not distinguish words related to several parts of speech. Some prepositions on this list are no different from words that are always capitalized – subordinate conjunctions, adverbs, and even adjectives. For example, "Down by the Riverside", but "Getting By on Just $30 a Day"; "A Ringing in My Ears", but "Bringing In the Sheaves".

Another circumstance that should also be considered is the possible use of u and ucfirst without forcing the string to be converted into lowercase. In this case, words already written in capital letters (for example, acronyms) will not change their spelling. Probably the abbreviations "FBI" and "LBJ" should not be converted to "Fbi" and "Lbj".

Cm. likewise

Describes the uc, lc, ucfirst, and lcfirst functions in perlfunc(1); description of modifiers \L, \U, \l and \u in the section "Quote and Quote-like Operators" perlop(1).

1.15. Interpolation of functions and expressions in strings

Problem

You want to interpolate a function call or expression contained in a string. Compared to interpolating simple scalar variables, this will allow you to construct more complex patterns.

Decision

The expression can be split into separate fragments and concatenate:

$answer = $var1 . func(). $var2;  # Только для скалярных величин

You can also use a few non-obvious extensions @{[LIST EXPR]} or ${\(SCALAR EXPR)}:

$answer = "STRING @{[ LIST EXPR ]} MORE STRING";
$answer = "STRING ${\( SCALAR EXPR )} MORE STRING";

Comment

The following excerpt demonstrates both options. In the first line, concatenation is performed, and in the second - focus with extension:

$phrase = "I have " . ($n + 1) . " guanacos.";
$phrase = "I have ${\($n + 1)} guanacos.";

In the first variant, the result string is constructed by concatenation of smaller strings; thus, we achieve the desired result without interpolation. The print function actually performs concatenation for the entire list of arguments, and if you're going to call print $phrase, you could just write:

print "I have ", $n + 1, " guanacos.\n";

If interpolation is absolutely inevitable, you will have to use the second option, replete with punctuation marks. Only the @, $, and \ characters have special meaning in quotation marks and in reverse apostrophes. As in the case of m// and s///, the synonym qx() does not obey the rules of extension for quotation marks if apostrophes are used as a limiter! In the expression $home = qx'echo home is $HOME'; the variable $HOME will be taken from the command interpreter, not from Perl! So, the only way to achieve the expansion of arbitrary expressions is to extend ${} or @{}, in whose blocks there are references.

In the example

$phrase = "I have ${\( count_em() )} guanacos.";

the call to the function in parentheses is not in a scalar context, but in a list context. The following construct overrides the context:

$phrase = "I have ${\( scalar count_em() )} guanacos.";

However, you can do more than just assign an interpolation value to a variable. We are dealing with a universal mechanism that can be used with any strings enclosed in quotation marks. So, in the following example, we construct a string with an interpolated expression and pass the result of the function:

some_func("What you want is @{[ split /:/, $rec ]} items");

Interpolation can also be performed in embedded documents:


die "Couldn't send mail" unless send_mail(<<"EOTEXT", $target);
To: $naughty
From: Your Bank
Cc: @{ get_manager_list($naughty) }
Date: @{[ do { my $now = `date`; chomp $now; $now} ]} (today)

Dear $naughty,

Today, you bounced check number @{[ 500 + int rand(100) ]} to us.
Your account is now closed.

Sincerely,
the management

EOTEXT

Extending strings in reverse apostrophes ('') turns out to be a particularly creative task, as it is often accompanied by the appearance of false line feed characters. By creating a block in curly braces for @ in the delimiting of an anonymous array @, as you did in the last example, you can create private variables.

All of these techniques work, but simply dividing a task into several steps or storing all the data in temporary variables is almost always more understandable to the reader.

The Interpolation module from the CPAN archive allows you to solve this problem using a more pleasant syntax. For example, the following snippet calculates the key and returns the value from the hash %E:


use Interpolation E => 'eval';
print "You bounced check number $E{500 + int rand(100)}\n";

In another example, the function you specified is called for the hash %money:


use Interpolation money => \&currency_commify;
print "That will be $money{ 4 * $payment }, right now.\n";

The result will look something like this:


That will be $3,232.421.04, right now.

Cm. likewise

perlref(1); Interpolation module from CPAN.

1.16. Indentation in Embedded Documents

Problem

When using the mechanism for defining long lines (embedded documents), the text should be aligned along the left margin; it's awkward in the program. You want to indent the text of the document in the program, but exclude indentation from the final content of the document.

Decision

Use the s/// operator to cut off the initial passes:

# All at once
($var = << HERE_TARGET) =~ s/^\s+//gm;
     followed by
     your text
HERE_TARGET

# Or two steps
$var = << HERE_TARGET;
     followed by
     your text
HERE_TARGET
$var =~ s/^\s+//gm;

Comment

Substitution is very straightforward: it removes the initial omissions from the text of the embedded document. The /m modifier allows the ^ metacharacter to match at the beginning of each logical line in the document, and the /g modifier causes the lookup engine to repeat the lookup at the maximum frequency (that is, for each line in the embedded document).


($definition = <<'FINIS') =~s/^\s+//gm;
    The five variations of camelids
    are the familiar camel, his frieds
    the llama and the alpaca, and the
    rather less well-known guanaco
    and vicuсa.
FINIS

Note that all templates in this recipe use the \s metacharacter, which denotes a single skip character, which can also represent a line feed character. As a result, all empty lines will be removed from the embedded document. If you do not want this, replace in the \s templates with [^\S\n].

The substitution uses the fact that the assignment result can be used on the left side =~. It becomes possible to do everything in one line, but it works only when assigning a variable. When used directly, the embedded document is interpreted as an immutable object, and you will not be able to modify it. Furthermore, the contents of an embedded document cannot be changed without first saving it to a variable.

However, there is no cause for concern. There is a simple workaround, especially useful if you perform this operation frequently. It is enough to write a small procedure:


sub-fix {
     my $string = shift;
     $string =~ s/^\s+//gm;
     return $string;
}

printfix(<<"END");
     Our Document
END

# If the function was declared beforehand, the parentheses can be omitted:
printfix<<"END";
     Our Document
END

As with all inline documents, the document end marker (END in our example) must be left-aligned. If you want to indent it as well, you will have to add the appropriate number of gaps to the document:


($quote = <<'    FINIS') =~s/^\s+//gm;
        ...we will have peace, when you and all you works have
        perished--and the works of your dark master to whom you would
        deliver us. You are a liar, Saruman, and a corrupter of men's
        hearts.  --Theoden in /usr/src/perl/taint.c

    FINIS
$quote =~ s/\s+--/\n--;  # Move to a separate line

If you perform this operation on documents that contain eval code or just output text, you do not want to bulk delete all initial omissions because it will destroy indentation in the text. Of course, it doesn't matter to eval, but not to readers.

We come to the next improvement - special prefixes for lines that must be indented. For example, in the following example, each line begins with @@@ and the desired indentation:


if ($REMEMBER_THE_MAIN) {
    $perl_main_C = dequote<<'    MAIN_INTERPRETER_LOOP';
        @@@ int
        @@@ runops() {
        @@@     SAVEI32(runlevel);
        @@@     runlevel++;
        @@@     while ( op = (*op->op_ppaddr)() ) ;
        @@@     TAINT_NOT;
        @@@     return 0;
        @@@ }
    MAIN_INTERPRETER_LOOP
    # Add additional code if desired
}

When destroying indentations, problems with poems also arise.


sub dequote;
$poem = dequote<<ever_on_and_on; now="" far="" ahead="" the="" road="" has="" gone,="" and="" i="" must="" follow,="" if="" can,="" pursuing="" it="" with="" eager="" feet,="" until="" joins="" some="" larger="" way="" where="" may="" paths="" errands="" meet.="" whither="" then?="" cannot="" say.="" --bilbo="" in="" usr="" src="" perl="" pp_ctl.c="" ever_on_and_on="" print="" "here's="" your="" poem:\n\n$poem\n";="" <p="">Результат будет выглядеть так:
Here's your poem:

Now far ahead the Road has gone,
   And I must follow, if I can,
Pursuing it with eager feet,
   Until it joins some larger way
Where may paths and errands meet.
   And whither then? I cannot say.
         --Bilbo in /usr/src/perl/pp_ctl.c
</ever_on_and_on;>

The following dequote function copes with all the problems described. When called, an embedded document is passed to it as an argument. The function checks whether each line begins with a common substring (prefix), and if so, removes this substring. Otherwise, it takes the initial gap from the first line and removes it from all subsequent lines.

 sub dequote {
     local $_ = shift;
     my($white, $leader); # space and prefix common to all strings
     if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) {
         ($white, $leader) = ($2, quotemeta($1));
     } else {
         ($white, $leader_ = (/^(\s+)/, '');
     }
     s/^\s*?$leader(?:$white)?//gm;
     return $_;
}

If at the sight of such templates your head goes around, you can always break them into several lines and add comments using the /x modifier:

if (m{
        ^ # start of line
        \s * # 0 or more whitespace characters
        (?: # start of first non-preserving group
            ( # start saving $1
               [^\w\s] # one byte - not a space or a word character
               + # 1 or more
            ) # finish saving $1
            ( \s* ) # put 0 or more spaces into buffer $2
            .* \n # search until the end of the first line
        ) # end of first grouping
        (?: # start of second non-save group
           \s * # 0 or more whitespace characters
           \1 # string destined for $1
           \2 ? # what will be in $2, but additionally
           .* \n # search until the end of the line
        ) + # repeat idea with groups 1 or more times
        $ # until the end of the line
     }x
   }
{
    ($white, $leader) = ($2, quotemeta($1));
} else {
    ($white, $leader) = (/^(\s+)/, '');
}
s{
     ^ # start of each line (due to /m)
     \s * # any number of leading spaces
        ? # with minimal match
     $leader # saved prefix
     (?: # start a non-preserving group
        $white # same amount
     ) ? # if the prefix is ​​followed by the end of the line
}{}xgm;

Well, hasn't it become clearer? Probably not. There is no point in enriching the program with banal comments that simply duplicate the code. Perhaps you have one of these cases.

Cm. likewise

"Scalar Value Constructors" perldata(1); description of the s/// operator in perlre(1) and perlop(1).

1.17. Reformatting paragraphs

Problem

The length of the text does not allow it to be placed on one line. You want to split it into several lines without wrapping words. For example, a style check script reads a text file one paragraph at a time and replaces failed turns more accurately. Replacing "applies functionality" with "uses" changes the number of characters in lines, so you'll have to reformat it before a paragraph can be displayed.

Decision

Use the standard Text::Wrap module to place line breaks in the right places:


use Text::Wrap;
@output = wrap($leadtab, $nexttab, @para);

You can use the more intelligent Text::Autoformat module from CPAN:


use Text::Autoformat;
$formatted = autoformat $rawtext;

Comment

The Text::Wrap module has a wrap function (see example 1.3) that retrieves a list of strings and reformats them into a paragraph with a line length of no more than $Text::Wrap::columns characters. We set the variable $columns to 20; this ensures that no line is longer than 20 characters. Two arguments are passed before the string list of the wrap function; one specifies the indentation of the first line of the paragraph, and the second specifies the indentation of all subsequent lines.

Example 1.3. wrapdemo


#!/usr/bin/perl -w
# wrapdemo - demonstration of work Text::Wrap

@input = ("Folding and splicing is the work of an editor,",
          "not a mere collection of silicon",
          "and",
          "mobile electrons!");

use Text::Wrap qw($columns &wrap);
$columns = 20;
print "0123456789" x 2, "\n";
print wrap("    ", "  ", @input), "\n";

The result looks like this: 01234567890123456789 Folding and splicing is the work of an editor, not a mere collection of silicon and mobile electrons!

As a result, we get one paragraph in which every line except the last one ends with a line feed character:

# Concatenate multiple lines with text wrapping
use Text::Wrap;
undef $/;
print wrap('', '', split(/\s*\n\s*/, <>);

If the Term::ReadKey module from CPAN is installed on your system, you can use it to determine the size of the window so that the length of the lines corresponds to the current screen size. If this module does not exist, the screen size can sometimes be taken from the $ENV or determined from the output of the stty command.

The following program reformats both paragraph lines that are too short and too long, similar to fmt. To do this, the input record separator $/ is set to an empty string (so <> reads entire paragraphs), and the $\ output separator is two line feeds. The paragraph is then converted to a single long line, replacing all line feeds (along with the surrounding gaps) with single spaces. Finally, we call the wrap function with empty indentation of the first and all subsequent lines:

use Text::Wrap qw(&wrap $columns);
use Term::ReadKey qw(GetTerminalSize);
($columns) = GetTerminalSize();
($/, $\) = ('', "\n\n"); # Read paragraph by paragraph, output two newlines
while (<>) { # Read entire paragraph
     s/\s*\n\s*/ /g; # Replace intermediate newlines with spaces
     print wrap('', '', $_); # and format
}

The CPAN Text::Autoformat module is much smarter. First of all, he tries to avoid "hanging lines", that is, very short ending lines. But the great thing is that it correctly reformats paragraphs with multiple levels of citation. An example from the documentation for this module shows how a simple call to print autoformat($badparagraph) converts text:


In comp.lang.perl.misc you wrote:
: >  writes:
: > CN> PERL sux because:
: > CN>    * It doesn't have a switch statement and you have to put $
: > CN>signs in front of everything
: > CN>    * There are too many OR operators: having |, || and 'or'
: > CN>operators is confusing
: > CN>    * VB rools, yeah!!!!!!!!!
: > CN> So anyway, how can I stop reloads on a web page?
: > CN> Email replies only, thanks - I don't read this newsgroup.
: >
: > Begone, sirrah! You are a pathetic, Bill-loving, microcephalic
: > script-infant.
: Sheesh, what's with this group - ask a question, get toasted! And how
: *dare* you accuse me of Ianuphilia!

to the following view:


In comp.lang.perl.misc you wrote:
: >  writes:
: > CN> PERL sux because:
: > CN>    * It doesn't have a switch statement and you
: > CN>      have to put $ signs in front of everything
: > CN>    * There are too many OR operators: having |, ||
: > CN>      and 'or' operators is confusing
: > CN>    * VB rools, yeah!!!!!!!!! So anyway, how can I
: > CN>      stop reloads on a web page? Email replies
: > CN>      only, thanks - I don't read this newsgroup.
: >
: > Begone, sirrah! You are a pathetic, Bill-loving,
: > microcephalic script-infant.
: Sheesh, what's with this group - ask a question, get toasted!
: And how *dare* you accuse me of Ianuphilia!

Spectacular, isn't it?

The following mini-program uses this module to reformat each paragraph in the input stream:


use Text::Autoformat;
$/ = '';
while (<>) {
    print autoformat($_, {squeeze => 0, all => 1}), "\n";
}

Cm. likewise

Description of split and join functions in perlfunc(1), manual page of the standard Text::Wrap module; the Term::ReadKey module from CPAN and an example of its use in recipe 15.6, as well as the CPAN Text::Autoformat module.

1.18. Escaping Characters

Problem

Some characters of the output string (apostrophes, commas, etc.) need to be escaped, that is, converted to a special form. Suppose you construct a format string for sprintf and want to replace % characters with %% sequences.

Decision

Use a wildcard that prefixes \ or doubles each character to be converted:

# Backslash
$var =~ s/([CHARLIST])/\\$1/g;

# Doubling
$var =~ s/([CHARLIST])/$1$1/g;

Comment

In the above solutions, $var is a modifiable variable, and CHARLIST is a list of characters to be converted, which can include combinations of type \t or \n. If only one character is converted, you can do without parentheses:

$string =~ s/%/%%/g; 

The transformations performed in the following example allow you to prepare a string to pass to the command interpreter. In practice, converting the characters ' and " will not yet make the arbitrary string completely safe for the command interpreter. Properly assembling the entire list of characters is so difficult, and the risk is so great, that it is better to use the list forms system and exec to run programs (see recipe 16.2) - in this case, you avoid interacting with the interpreter altogether.


$string = q(Mom said, "Don't do that.");
$string =~ s/(['"])/\\$1/g;

Two backslashes in the substitute section were used because this section is interpreted according to the rules for quoted strings. Therefore, to get one backslash, you have to write two. Here is a similar example for VMS DCL, where all apostrophes and quotation marks are duplicated:


$string = q(Mom said, "Don't do that.");
$string =~ s/(['"])/$1$1/g;

With Microsoft's command interpreters, things are even more complicated. In Windows, COMMAND.COM works with quotation marks, but not with apostrophes; has no idea how to use inverse apostrophes to run commands, and a backslash is used to turn quotation marks into literals. However, in almost all free and commercial Unix-like command interpreters for Windows, this drawback is corrected.

Regular expressions support character classes, so you can also define an interval using -, and then invert it with the ^ metacharacter. The following command escapes all characters that are not in the interval A through Z:


$string =~ s/([^A-Z])/\\$1/g;

To convert all non-alphabetic characters, use the \Q and \E metacharacters or the quotemeta function. For example, the following commands are equivalent:


$string = "this \Qis a test!\E";
$string = "this is\\ a\\ test!";
$string = "this " . quotemeta("is a test!");

Cm. likewise

Description of the s/// operator in perlre(1) and perlop(1); the description of the quotemeta function is discussed in perlfunc(1). Recipe 19.1 covers the escaping of service characters in HTML, and recipe 19.5 talks about how to avoid escaping when passing lines to a command interpreter.

1.19. Removing Gaps at Both Ends of a Line

Problem

The resulting string can include start or end omissions. You want to remove them.

Decision

Use a couple of substitutions:


$string =~ s/^\s+//;
$string =~ s/\s+$//;

Or write a special function that returns the desired value:

$string = trim($string);
@many = trim(@many);

sub trim {
         my @out = @_;
         for (@out) {
             s/^\s+//; # Remove gaps on the left
             s/\s+$//; # Remove gaps on the right
     }
     return @out == 1
               ? $out[0] # One row returned
               : @out[0]; # Many rows returned
}

Comment

There are various solutions to this problem, but in most cases this option is the most effective. The function returns new versions of the passed strings, from which the leading and ending spaces are removed. The function works with both individual rows and lists.

To remove the last character from a string, use the chop function. Be careful not to confuse it with the similar chomp function, which removes the last part of a string in and only if it is contained in the $/ variable (the default is "\n"). It is most commonly used to remove the final line feed character from the entered text:

 


# Output the resulting text enclosed in ><
while() {
    chomp;
    print ">$_<\n";
}

The feature can be improved in several ways.

First of all, what if multiple lines are passed to the function and the context of the return value requires a single scalar value? In the form in which it is given in the Solution, the function acts rather stupidly: it returns a scalar value that represents the number of rows passed. Other options are also possible - for example, to issue a warning, or to throw an exception, or to combine the list of returned rows into one line.

If extra gaps can be not only at the ends, but also in the middle, the function can also replace the internal series of gaps with single spaces. To do this, an additional final command is placed in the loop:


s/\s+/ /g;            # Convolution of internal gaps 

A string of the form "but\t\tnot here\n" becomes "but not here". Three consecutive substitutions


s/^\s+//;
s/\s+$//;
s/\s+/ /g;

more effective to replace with a team


$_ = join(' ', split(' '));

If the function is called with no arguments at all, you can follow the example of chop and chomp and use $_ by default. After all these improvements, we get the following function:

# 1. Trim leading and trailing gaps
# 2. Collapse inner spaces into single spaces
# 3. If there are no arguments, the input data is taken from $_
# 4. When returning in a scalar context
# the list is concatenated into a scalar with spaces in between.
sub trim {
     my @out = @_ ? @_ : $_;
     $_ = join(' ', split(' ')) for @out;
     return wantarray ? @out : "@out";
}

Cm. likewise

Description of the s/// operator in perlre(1) and perlop(1); describes the chop and chomp functions in perlfunc(1). The initial omissions are removed in the getnum function from recipe 2.1.

1.20. Analyzing Comma-Separated Data

Problem

There is a data file whose fields are separated by commas. You want to read the data from the file. However, the fields may have their own commas (inside the strings or escaped). In many spreadsheets and DBMSs, comma-separated field lists are supported as the standard format for importing/exporting data.

Decision

If the contents of the data file follow the standard Unix shielding rules (that is, the internal quotation marks in the fields are escaped by a backslash: "like \"this\"", use the standard Text::P arseWords module and a simple program:


use Text::ParseWords;
sub parse_csv0 {
    return quotewords("," => 0, $_[0]);
}

If the quotation marks in the fields are doubled ("like "this"" you can use the standard procedure from Jeffrey Friedl's book Regular Expressions: Programmer's Library, 2nd Edition (Peter Publishing House, 2003):


sub parse_csv1 {
    my $text = shift; # Write with comma-separated values
    my @fields = ( );
    while ($text =~ m{
        # An arbitrary sequence of characters, except for commas and quotes:
         ( [^"',] + )
        # ...or...
          |
        # ... quoted field (double quotes are allowed within a field)

         " # Opening quote of the field (do not save)
          ( # Now the field contains either
           (?: [^"] # characters other than quotes, or
             |
               "" # adjacent quotes
            ) * # Repeat any number of times
          )
         " # Closing quote of the field (don't save)

      }gx)
      {
        if (defined $1) {
            $field = $1;
        } else {
            ($field = $2) =~ s/""/"/g;
        }
        push @fields, $field;
      }
      return @fields;
}

You can also use the CPAN Text::CSV module:


use Text::CSV;
sub parse_csv1 {
    my $line = shift;
    my $csv = Text::CSV->new( );
    return $csv->parse($line) && $csv->fields( );
}

Or the CPAN Tie::CSV_File module:


tie @data, "Tie::CSV_File", "data.csv";

for ($i = 0; $i < @data; $i++) {
    printf "Row %d (Line %d) is %s\n", $i, $i+1, "@";
    for ($j = 0; $j < @; $j++) {
        print "Column $j is <$data[$i][$j]>\n";
    }
}

Comment

Entering data separated by commas is an insidious and difficult task. At first glance, everything is simple, but in reality you have to take into account the rather complex possibilities of shielding, since the fields themselves can contain internal commas. As a result, the search by pattern is very complicated, and it is better not to think about a simple call to the split /,/ function. To make matters worse, Unix standard files and legacy systems use different shielding rules. Because of this, it is impossible to develop a single algorithm for all CSV data files.

The standard Text::P arseWords module is designed to process data according to the standards used in most Unix data files. Due to this, it is extremely convenient for parsing all kinds of Unix system files, in which fields are separated by colons - disktab(5), gettytab(5), printcap(5) and termcap(5). The quotewords function of this module is passed two arguments and a row of separated data. The first argument defines a delimiter character (in our case, a comma, but a colon is often used), and the second is a Boolean flag that indicates whether strings should be returned along with the quotation marks in which they are enclosed.

In such data files, quotation marks inside the fields are escaped by a backslash: "like \"this\". Quotation marks, apostrophes, and a backslash are the only characters for which this prefix has a special meaning. All other instances of \ remain in the resulting row. To work with such data, the quotewords function of the standard Text::P arseWords module is enough.

However, this solution is not suitable for data files from legacy systems where internal quotation marks are doubled: "like "this"". In such cases, it is necessary to resort to other solutions. The first is based on a regular expression in the second edition of Jeffrey Friedl's regular expressions: The Programmer's Library. Its advantage should be considered that the solution works in any system without installing additional modules that are not included in the standard delivery. In fact, it does not require any modules at all. Nevertheless, despite the abundance of comments, this decision causes a slight shock to the unprepared reader.

The object-oriented CPAN Text::CSV module involved in the following solution hides the complexities of parsing in more convenient "wrappers". The Tie::CSV module from CPAN offers an even more elegant solution: you are working with an object that looks like a two-dimensional array. The first index represents the rows of the file, and the second index represents its columns.

Let's look at examples of practical use of our functions parse_csv. Here, q() is just a clever substitute for quotation marks so that we don't have to put \ characters everywhere.


$line = q(XYZZY,"","O'Reilly, Inc","Wall, Larry","a \"glug\" bit,",5,"Error, Core Dumped");
@fields = parse_csv0($line);
for ($i = 0;$i < @fields; $i++) {
    print "$i : $fields[$i]\n";
}
0 : XYZZY
1 :
2 : O'Reilly, Inc
3 : Wall, Larry
4 : a "glug" bit,
5 : 5
6 : Error, Core Dumped

If the second quotewords argument were 1 instead of 0, the quotation marks would be retained and the result would be:


0 : XYZZY
1 : ""
2 : "O'Reilly, Inc"
3 : "Wall, Larry"
4 : "a \"glug\" bit,"
5 : 5
6 : "Error, Core Dumped"

Another kind of data files is handled in exactly the same way, but parse_csv1 is used instead of parse_csv0. Note the doubling of quotation marks instead of escaping with a prefix:


$line = q(Ten Thousand,10000, 2710 ,,"10,000",
"It's ""10 Grand"", baby",10K);
@fields = parse_csv1($line);
for ($i = 0; $i < @fields; $i++) {
    print "$i : $fields[$i]\n";
}
0 : Ten Thousand
1 : 10000
2 : 2710
3 :
4 : 10,000
5 : It's "10 Grand", baby
6 : 10K

Cm. likewise

Describes the syntax of regular expressions in perlre(1); Documentation for the standard Text::P arseWords module See "Parsing Comma-Separated Data" in Chapter 5 of Regular Expressions: Programmer's Library, 2nd Edition.

1.21. Constants

Problem

You want to create a variable whose value cannot change after the initial assignment.

Decision

If the value does not have to be a scalar variable that can be interpolated, you can get by with the use constant directive:


use constant AVOGADRO => 6.02252e23;

printf "You need %g of those for guac\n", AVOGADRO;

If you need a variable, assign the glob type a reference to a literal string or number, and then use a scalar variable:


*AVOGADRO = \6.02252e23;
print "You need $AVOGADRO of those for guac\n";

But the most reliable method is based on the use of a small tie class with an exception in the STORE method:


package Tie::Constvar;
use Carp;
sub TIESCALAR {
    my ($class, $initval) = @_;
    my $var = $initval;
    return bless \$var => $class;
}
sub FETCH {
    my $selfref = shift;
    return $$selfref;
}
sub STORE {
    confess "Meddle not with the constants of the universe";
}

Comment

The use constant directive is easiest to use, but it has a number of disadvantages, the biggest of which is that it does not create a normal variable to interpolate in quoted strings. Another drawback is the lack of scope; the directive includes the procedure with the given name in the package's namespace.

In fact, the use constant directive creates a procedure with the given name, which is called without arguments and always returns the same value (or several values ​​in the form of a list). This means that the procedure is in the namespace of the current package and is not scoped. The same procedure can be defined independently in the program:


sub AVOGADRO() { 6.02252e23 }

To make the scope of a constant limited to the current block, you can create a temporary procedure, for which the anonymous procedure is assigned to a glob type with the desired name:


use subs qw(AVOGADRO);
local *AVOGADRO = sub () { 6.0225e23 };

Admittedly, such tricks look rather mysterious. If you refuse to use the directive, comment your code.

If instead of a procedure reference, you give the glob type a reference to a const scalar, then you can use a variable with the appropriate name. This is the basis of the second technique given in the Decision. It has its drawbacks: typeglobs are only available for package variables, not for lexical ones created with the my keyword. There will be problems with undeclared package variables under the recommended use strict directive, but a variable can be declared using our:


our $AVOGADRO;
local *AVOGADRO = \6.02252e23; 

The third solution - creating a small tie class - may seem the most confusing, but it provides the most flexibility. In addition, if desired, a constant can be declared as a lexical variable:


tie my $AVOGARO, Tie::Constvar, 6.02252e23; 

After that, you can safely use constructions of the form


print "You need $AVOGADRO of those for guac\n";

Any attempt to modify the constant will be rejected:


$AVOGADRO = 6.6256e-34;   # Ничего не выйдет

See also

Recipes 1.15 and 5.3; some ideas can also be gleaned from the CPAN Tie::Scalar::RestrictUpdates module.

1.22. Comparison of words with similar sound

 

Problem

There are two English surnames. You want to know if they sound similar (regardless of spelling). This will allow you to perform an "informal search" in the phone book, which will show other similar names along with Smith, such as Smythe, Smite and Smote.

Solution

Use the standard Text::Soundex module:


use Text::Soundex;
$CODE = soundex($STRING);
@CODES = soundex(@LIST); 

You can also use the CPAN Text::Metaphone module:


use Text::Metaphone;
$phoned_words = Metaphone('Schwern'); 

 

Comment



The Soundex algorithm hashes words (especially English surnames) in a small space using a simple model that mimics English pronunciation. Roughly speaking, each word is reduced to a four-character string. The first character is an uppercase letter and the rest are numbers. By comparing the values ​​for two strings, you can determine if they sound similar.

The following program prompts you for a name and searches the password file for names that sound similar. A similar approach can be used for name databases, so the database can be indexed by Soundex keys if desired. Of course, such an index will not be unique.


use Text::Soundex;
use User::pwent;
print "Lookup user: ";
chomp($user = );
exit unless defined $user;
$name_code = soundex($user);

while($uent = getpwent()) {
    ($firstname, $lastname) = $uent->gecos =~ /(\w+)[^,]*\b(\w+)/;

    if ($name_code eq soundex($uent->name) ||
        $name_code eq soundex($lastname)   ||
        $name_code eq soundex($firstname)  )
    {
        printf "%s: %s %s\n", $uent->name, $firstname, $lastname;
    }
}

The Text::Metaphone module from the CPAN archive solves the same problem in a different, more reasonable way. The soundex function returns a code of a letter and three digits to start the input string, and the Metaphone function returns the code as a sequence of variable-length letters. Example:

 soundexmetaphone
ChristiansenC623KRSXNSN
Kris JensonK625KRSJNSN
 
Kyrie EleisonK642KRLSN
Curious LiaisonC624KRSLSN

To fully implement the features of Metaphone, you should also use the String::Approx module from CPAN, described in more detail in recipe 6.13. This module allows you to find a successful match even if there are individual errors in the strings. The number of changes required to move from one line to another is called the spacing between those lines. The following command checks for strings separated by a distance of 2 or less:

if (amatch("string1", [2], "string2") { }

There is also an adist function that returns the spacing between strings. For example, the distance between "Kris Jenson" and "Christiansen" is 6, whereas the distance between their Metaphone codes is only 1. The distance between the components of the other pair in the original version is 8, and when comparing the Metaphone codes, it is again reduced to 1:


use Text::Metaphone qw(Metaphone);
use String::Approx qw(amatch);

if (amatch(Metaphone($s1), [1], Metaphone($s1)) {
    print "Close enough!\n";
}

This snippet will find successful matches for both pairs in the example.

Cm. likewise

Documentation for the standard Text::Soundex and User::p went modules; CPAN Modules Text::Metaphone and String::Approx; your system's passwd(5) manual page; volume 3, chapter 6 "The Art of Programming".

1.23. Program: fixstyle

Imagine a table with pairs of obsolete and new words: bonnet

Old wordsNew words
Hood
rubberEraser
lorryTruck
trousersPants

The program in Example 1.4 is a filter that replaces all the words in the text from the first column with the corresponding elements of the second column.

When called without file arguments, the program performs the functions of a simple filter. If file names are passed on the command line, the program writes changes to them, and previous versions are saved in files with *.orig extensions (see recipe 7.16). If you have the -v command-line option, messages about all changes are written to STDERR.

The table of source/substitute pairs is stored in the main program, starting with the marker __END__ (see recipe 7.12). The pairs are converted to substitutions (with character escapes) and accumulated in a variable $code just as it is done in the popgrep2 program from recipe 6.10.

The -t option displays a message that indicates that keyboard input is pending when no other arguments are present. If the user forgot to enter the file name, he will immediately understand what the program expects.


Example 1.4. fix style
#!/usr/bin/perl -w
# fixstyle - replacing section lines with paired lines
# usage: $0 [-v] [files...]
use strict;
my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift);
if (@ARGV) {
     $^I = ".orig"; # Keep old files
} else {
     warn "$0: Reading from stdin\n" if -t STDIN;
}
my $code = "while (<>) {\n";
# Read data and build code for eval
while() {
     chomp;
     my ($in, $out) = split /\s*=>\s*/;
     next unless $in && $out;
     $code .= "sg";
     $code .= "&& printf STDERR qq($in => $out at \$ARGV line \$.\\n)"
                                                         if $verbose;
     $code .= ";\n";
} $code .= "print;\n}\n"; eval "{ $code } 1" || die; __END__ analysed => analyzed built-in => builtin chastized => chastised commandline => command-line de-allocate => deallocate dropin => drop-in hardcode => hard-code meta-data => metadata multicharacter => multi-character multiway => multi-way non-empty => nonempty non-profit => nonprofit non-trappable => nontrappable pre-define => predefine preextend => pre-extend re-compiling => recompiling reenter => re-enter turnkey => turn-key

A small warning: the program works quickly, but not in cases where the number of replacements is measured in hundreds. The larger the DATA section, the longer it will take. A few dozen substitutions will not cause a significant slowdown. Moreover, for a small number of replacements, this version works faster than the next one. But if you run a program with several hundred replacements, it will begin to lag noticeably behind.

Example 1.5 shows a different version of the program. With a small number of replacements, it works slower, and with a large number - faster.

Example 1.5. fixstyle2
#!/usr/bin/perl -w
# fixstyle2 = similar to fixstyle for a large number of replacements
use strict;
my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift);
my %change = ();
while() {
     chomp;
     my ($in, $out) = split /\s*=>\s*/;
     next unless $in && $out;
     $change = $out;
}
if (@ARGV) {
     $^I = ".orig";
} else {
     warn "$0: Reading from stdin\n" if -t STDIN;
}
while (<>) {
     my$i=0;
     s/^(\s+)// && print $1; # Issue initial pass
     for (split /(\s+)/, $_, -1) { # Preserve trailing spaces
         print( ($i++ & 1) ? $_ : ($change || $_));
     } } _ _END_ _ analysed => analyzed built-in => builtin chastized => chastised commandline => command-line de-allocate => deallocate dropin => drop-in hardcode => hard-code meta-data => metadata multicharacter => multi-character multiway => multi-way non-empty => nonempty non-profit => nonprofit non-trappable => nontrappable pre-define => predefine preextend => pre-extend re-compiling => recompiling reenter => re-enter turnkey => turn-key

In the new version of the program, each line is divided into gaps and words (a relatively slow operation). The words are then used to find a replacement in the hash, which is much faster than substitution. Consequently, the first part is slower and the second part is faster. The gain in speed depends on the number of matches.

If we didn't try to keep the number of gaps separating words, it would not be difficult to make sure that the second version is not inferior in speed to the first even with a small number of substitutions. If you are well aware of the specifics of the input data, you can replace the gaps with single spaces. To do this, the following cycle is used:


# Works very fast, but with skip rollup while (<>) 
{ for (split) { print $change || $_, " "; } print "\n"; }

An extra space appears at the end of each line. If this is not desirable, use recipe 16.5 and create an output filter. Insert the following fragment before the while loop that compresses the gaps:


my $pid = open(STDOUT, "|=");
die "cannot fork: $!" unless defined $pid;
unless ($pid) {
        while () {
        s/ $//;
        print;
    }
    exit;
}

1.24. Program: psgrep

Many programs (including ps, netstat, ls -l, find -ls, and tcpdump) often produce large amounts of data. Log files also grow rapidly in size, making them difficult to view. Such data can be processed by a grep-type filter program and only a fraction of the strings can be selected from them, but regular expressions do not agree well with complex logic - just look at the tricks that you have to indulge in in recipe 6.18.

In fact, we would like to be able to handle full queries on the output of the program or log file. Let's say you ask ps: "Show me all unprivileged processes larger than 10 KB" or "What commands work on pseudo-consoles?"

The psgrep program is able to do all of these things - and infinitely more, because in it the selection criteria are not regular expressions; they consist of full-fledged Perl code. Each criterion is applied sequentially to each output line. As a result, only data that satisfies all arguments are displayed. The following are examples of search criteria and their corresponding command lines:

    • Lines with words ending in sh:
% psgrep '/sh\b/'
    • Processes with command names ending in sh:
% psgrep 'command =~ /sh$/'
    • Processes with a user ID smaller than 10:
% psgrep 'uid < 10'
    • Interpreters with active consoles:
% psgrep 'command =~ '/^-/' 'tty ne "?"'
    • Processes running on pseudoconsols:
% psgrep 'tty =~ /^[p-t]'
    • Disconnected unprivileged processes:
% psgrep 'uid && tty eq "?"'
    • Large unprivileged processes:
% psgrep 'size > 10 * 2**10' 'uid != 0'

Below is the data retrieved from the last psgrep call on our computer. As you might expect, only netscape and its auxiliary process got into them:

FLAGS

UID

PID

PPID

PRI

NI

SIZE

RSS

WCHAN

STA

TTY

TIME

COMMAND

0

101

9751

1

0

0

14932

9652

do_select

S

p1

0:25

netscape

100000

101

9752

9751

0

0

10636

812

do_select

S

p1

0:00

(dns helper)

Example 1.6 shows the source code for psgrep.


Example 1.6. psgrep
#!/usr/bin/perl -w
# psgrep - filter ps output
# with compilation of user requests into program code
#
use strict;
# All fields from PS header
my @fieldnames = qw(FLAGS UID PID PPID PRI NICE SIZE
                    RSS WCHAN STAT TTY TIME COMMAND);
# Defining the unpacking format (in the example
# hardcoded ps format for Linux)
my $fmt = cut2fmt(8, 14, 20, 26, 30, 34, 41, 47, 59, 63, 67, 72);
my %fields; # To store data
die << Thanatos unless @ARGV;
usage: $0 criterion ...
    Each criterion is a Perl expression involving:
    @fieldnames
    All criteria must be met for a line to be printed.
Thanatos
# Create synonyms for uid, size, UID, SIZE, etc.
# Empty parentheses are required to create a prototype with no arguments
for my $name (@fieldnames) {
    no strict 'refs';
    *name = *{lc $name} = sub () { $fields };
}
my $code = "sub is_desirable { " . join(" and ", @ARGV) . "}";
unless (eval $code.1) {
    die "Error in code: $@\n\t$code\n";
}
open (PS, "ps wwaxl |") || die "cannot fork: $!";
print scaler ; # Header line
while ( {
    @fields = trim(unpack($fmt, $_));
    print if is_desirable(); # Rows matching the criteria
}
close(PS) || die "ps failed!";

# Convert cut positions to unpack format
sub cut2fmt {
    my(@positions) = @_;
    my $template = '';
    my $lastpos = 1;
    foreach $place(positions) {
        $template .= "A" . ($place - $lastpos) . " ";
        $lastpos = $place;
    }
    $template .= "A*";
    return $template;
}
sub trim {
    my @strings = @_;
    for (@strings) {
        s/^\s+//;
        s/\s+$//;
    }
    return wantarray ? @strings : $strings[0];
}
# The following template was used to define cut positions.
# The following is an example input
#123456789012345678901234567890123456789012345678901234567890123456789012345
# 1 2 3 4 5 6 7
# Positions:
#8 14 20 26 30 34 41 47 59 63 67 72
# | | | | | | | | | | | |
_ _END_ _
 FLAGS UID PID PPID PRI NI SIZE RSS WCHAN STA TTY TIME COMMAND
   100 0 1 0 0 0 760 432 do_select S ? 0:02 init
   140 0 187 1 0 0 784 452 do_select S ? 0:02 syslogd
100100 101 428 1 0 0 1436 944 do_exit S 1 0:00 /bin/login
100140 99 30217 402 0 0 1552 1008 posix_lock S ? 0:00 httpd
     0 101 593 428 0 0 1780 1260 copy_thread S 1 0:00 -tcsh
100000 101 30639 9562 17 0 924 496 R p1 0:00 ps axl
     0 101 25145 9563 0 0 2964 2360 idetape_rea S p2 0:06 trn
100100 0 10116 9564 0 0 1412 928 setup_frame T p3 0:00 ssh -C www
100100 0 26560 26554 0 0 1076 572 setup_frame T p2 0:00 less
100000 101 19058 9562 0 0 1396 900 setup_frame T p1 0:02 nvi /tmp/a

The psgrep program combines many of the techniques presented in the book. The removal of start and end gaps is described in recipe 1.19. Converting slice items to the unpack format to extract fixed-position fields is discussed in Recipe 1.1. Regular expression search is the subject of the entire chapter 6.

The multiline text transmitted by die is an embedded document (see recipes 1.15 and 1.16). Assigning @fields enters several values into the hash %fields at once. Hash slices are discussed in recipes 4.8 and 5.11.

The sample program inputs located under the __END__ are described in recipe 7.12. At the development stage, "canned" data obtained through the DATA file manipulator was used for testing. When the program worked, we transferred it to receive data from the attached ps command, but the original data was left for future porting to other platforms and maintenance. Pipeline running of other programs is discussed in Chapter 16, "Process Control and Interprocess Communications," in more detail in recipes 16.10 and 16.13.

The real power and expressiveness of psgrep is due to the fact that in Perl string arguments can be not just strings, but Perl program code. A similar technique is used in recipe 9.9, except that in psgrep the user's arguments are "packaged" in the is_desirable procedure. At the same time, strings are compiled into Perl code only once - even before starting the program whose output we are processing. For example, if you request a UID below 10, the following line will be generated:

eval "sub is_desirable { uid < 10 } " . 1; 

The mysterious .1 at the end is present so that when the user code is compiled, the eval command returns the true value. In this case, we won't even have to check $@ for compilation errors like we do in recipe 10.12.

Using arbitrary Perl code in filters to select records is an incredibly powerful feature, but it's not completely original. Perl owes much to the awk programming language, which was often used for such filtering. One of the drawbacks of awk was that it could not easily interpret the input data as fixed-length fields (instead of fields separated by special characters). Another drawback is the lack of mnemonic field names; awk used the names $1, $2, etc. In addition, Perl can do many of the things that awk is not capable of.

User criteria don't even have to be simple expressions. For example, the following call initializes a variable $id the user number nobody and then uses it in an expression:


% psgrep 'no strict "vars";
         BEGIN { $id = getpwnam("nobody") }
         uid == $id '

But how do you use the words uid, command, and size without even putting them with a $ character to represent the corresponding input fields? We directly manipulate the symbolic name table by assigning closures to implicit typeglobs that create functions with corresponding names. Function names are created using both uppercase and lowercase entries, allowing for both "UID < 10" and "uid > 10". Closures are described in recipe 11.4, and their assignment to type-globes to create synonyms of functions is described in recipe 10.14.

However, there is a nuance in psgrep that is absent in these recipes - we are talking about empty parentheses in the closure. Thanks to parentheses, the function can be used in expressions wherever a single value is allowed (for example, a string or a numeric constant). As a result, an empty prototype is created, and the function of accessing the field (for example, uid) is called without arguments, by analogy with the built-in time function. If you do not create prototypes for functions with empty lists of arguments, the expressions "uid < 10" or "size/2 > rss" will confuse the lexical analyzer - it will see in them an incomplete globe (wildcard glob) or a search pattern, respectively. Prototypes are discussed in recipe 10.11.

The shown version of psgrep receives input from the ps command in Red Hat Linux format. To adapt it for a different system, look at which columns the headers start in. This approach is not limited to the specifics of the ps or Unix system. This is a common technique for filtering input records using Perl expressions that easily adapts to a different record structure. Fields can be arranged in columns, separated by commas, or retrieved by a pattern search using retaining parentheses.

After a small change in the selection functions, the program is even suitable for working with a user database. If you have an array of records (see recipe 11.9), the user can specify an arbitrary selection criterion:

sub id() { $_-> }
subtitle() { $_-> }
sub executive { title =~/(?:vice-)?president/i }

# Selection criteria are specified when calling grep
@slowburners = grep { id < 10 && !executive } @employees;

 

For reasons related to security and performance, this approach is rarely found in the actual mechanisms described in Chapter 14, "Accessing Databases." In particular, it's not supported in SQL, but with Perl at its disposal and a bit of ingenuity, it's not hard to create your own version.