[Chapter 1] 1.7 Regular Expressions

1.7 Regular Expressions

Regular expressions (aka regexps, regexes or REs) are used by many UNIX programs, such as grep, sed and awk,[24] editors like vi and emacs, and even some of the shells. A regular expression is a way of describing a set of strings without having to list all the strings in your set.

[24] A good source of information on regular expression concepts is the Nutshell Handbook sed & awk by Dale Dougherty (O'Reilly & Associates). You might also keep an eye out for Jeffrey Friedl's forthcoming book, Mastering Regular Expressions (O'Reilly & Associates).

Regular expressions are used several ways in Perl. First and foremost, they're used in conditionals to determine whether a string matches a particular pattern. So when you see something that looks like /foo/, you know you're looking at an ordinary pattern-matching operator.

Second, if you can locate patterns within a string, you can replace them with something else. So when you see something that looks like s/foo/bar/, you know it's asking Perl to substitute "bar" for "foo", if possible. We call that the substitution operator.

Finally, patterns can specify not only where something is, but also where it isn't. So the split operator uses a regular expression to specify where the data isn't. That is, the regular expression defines the delimiters that separate the fields of data. Our grade example has a couple of trivial examples of this. Lines 5 and 12 each split strings on the space character in order to return a list of words. But you can split on any delimiter you can specify with a regular expression.

(There are various modifiers you can use in each of these situations to do exotic things like ignore case when matching alphabetic characters, but these are the sorts of gory details that we'll cover in Chapter 2, The Gory Details.)

The simplest use of regular expressions is to match a literal expression. In the case of the splits we just mentioned, we matched on a single space. But if you match on several characters in a row, they all have to match sequentially. That is, the pattern looks for a substring, much as you'd expect. Let's say we want to show all the lines of an HTML file that are links to other HTML files (as opposed to FTP links). Let's imagine we're working with HTML for the first time, and we're being a little naive yet. We know that these links will always have "http:" in them somewhere. We could loop through our file with this:[25]

[25] This is very similar to what the UNIX command grep 'http:' file would do. On MS-DOS you could use the find command, but it doesn't know how to do more complicated regular expressions. (However, the misnamed findstr program of Windows NT does know about regular expressions.)

while ($line = <FILE>) {
    if ($line =~ /http:/) {
        print $line;
    }
}

Here, the =~ (pattern binding operator) is telling Perl to look for a match of the regular expression http: in the variable $line. If it finds the expression, the operator returns a true value and the block (a print command) is executed. By the way, if you don't use the =~ binding operator, then Perl will search a default variable instead of $line. This default space is really just a special variable that goes by the odd name of $_. In fact, many of the operators in Perl default to using the $_ variable, so an expert Perl programmer might write the above as:

while (<FILE>) {
    print if /http:/;
}

(Hmm, another one of those statement modifiers seems to have snuck in there. Insidious little beasties.)

This stuff is pretty handy, but what if we wanted to find all the links, not just the HTTP links? We could give a list of links, like "http:", "ftp:", "mailto:", and so on. But that list could get long, and what would we do when a new kind of link was added?

while (<FILE>) {
    print if /http:/;
    print if /ftp:/;
    print if /mailto:/;
    # What next?
}

Since regular expressions are descriptive of a set of strings, we can just describe what we are looking for: a number of alphabetic characters followed by a colon. In regular expression talk (Regexpese?), that would be /[a-zA-Z]+:/, where the brackets define a character class. The a-z and A-Z represent all alphabetic characters (the dash means the range of all characters between the starting and ending character, inclusive). And the + is a special character which says "one or more of whatever was before me". It's what we call a quantifier, meaning a gizmo that says how many times something is allowed to repeat. (The slashes aren't really part of the regular expression, but rather part of the pattern match operator. The slashes are acting like quotes that just happen to contain a regular expression.)

Because certain classes like the alphabetics are so commonly used, Perl defines special cases for them. See Table 1.7 for these special cases.

Table 1.7: Regular Expression Character Classes
Name	Definition	Code
Whitespace	`[ \t\n\r\f]`	`\s`
Word character	`[a-zA-Z_0-9]`	`\w`
Digit	`[0-9]`	`\d`

Note that these match single characters. A \w will match any single word character, not an entire word. (Remember that + quantifier? You can say \w+ to match a word.) Perl also provides the negation of these classes by using the uppercased character, such as \D for a non-digit character.

(We should note that \w is not always equivalent to [a-zA-Z_0-9]. Some locales define additional alphabetic characters outside the ASCII sequence, and \w respects them.)

There is one other very special character class, written with a ".", that will match any character whatsoever.[26] For example, /a./ will match any string containing an "a" that is not the last character in the string. Thus it will match "at" or "am" or even "a+", but not "a", since there's nothing after the "a" for the dot to match. Since it's searching for the pattern anywhere in the string, it'll match "oasis" and "camel", but not "sheba". It matches "caravan" on the first "a". It could match on the second "a", but it stops after it finds the first suitable match, searching from left to right.

[26] Except that it won't normally match a newline. When you think about it, a "." doesn't normally match a newline in grep (1) either.

Quantifiers

The characters and character classes we've talked about all match single characters. We mentioned that you could match multiple "word" characters with \w+ in order to match an entire word. The + is one kind of quantifier, but there are others. (All of them are placed after the item being quantified.)

The most general form of quantifier specifies both the minimum and maximum number of times an item can match. You put the two numbers in braces, separated by a comma. For example, if you were trying to match North American phone numbers, / \d{7,11}/ would match at least seven digits, but no more than eleven digits. If you put a single number in the braces, the number specifies both the minimum and the maximum; that is, the number specifies the exact number of times the item can match. (If you think about it, all unquantified items have an implicit {1} quantifier.)

If you put the minimum and the comma but omit the maximum, then the maximum is taken to be infinity. In other words, it will match at least the minimum number of times, plus as many as it can get after that. For example, / \d{7}/ will only match a local (North American) phone number (seven digits), while / \d{7,}/ will match any phone number, even an international one (unless it happens to be shorter than seven digits). There is no special way of saying "at most" a certain number of times. Just say /.{0,5}/, for example, to find at most five arbitrary characters.

Certain combinations of minimum and maximum occur frequently, so Perl defines special quantifiers for them. We've already seen +, which is the same as {1,}, or "at least one of the preceding item". There is also *, which is the same as {0,}, or "zero or more of the preceding item", and ?, which is the same as {0,1}, or "zero or one of the preceding item" (that is, the preceding item is optional).

There are a couple things about quantification that you need to be careful of. First of all, Perl quantifiers are by default greedy. This means that they will attempt to match as much as they can as long as the entire expression still matches. For example, if you are matching / \d+/ against "1234567890", it will match the entire string. This is something to especially watch out for when you are using ".", any character. Often, someone will have a string like:

spp:Fe+H20=FeO2;H:2112:100:Stephen P Potter:/home/spp:/bin/tcsh

and try to match "spp" with /.+:/. However, since the + quantifier is greedy, this pattern will match everything up to and including "/home/spp". Sometimes you can avoid this by using a negated character class, that is, by saying /[^:]+:/, which says to match one or more non-colon characters (as many as possible), up to the first colon. It's that little caret in there that negates the sense of the character class.[27] The other point to be careful about is that regular expressions will try to match as early as possible. This even takes precedence over being greedy. Since scanning happens left-to-right, this means that the pattern will match as far left as possible, even if there is some other place where it could match longer. (Regular expressions are greedy, but they aren't into delayed gratification.) For example, suppose you're using the substitution command (s///) on the default variable space (variable $_, that is), and you want to remove a string of x's from the middle of the string. If you say:

[27] Sorry, we didn't pick that notation, so don't blame us. That's just how regular expressions are customarily written in UNIX culture.

$_ = "fred xxxxxxx barney";
s/x*//;

it will have absolutely no effect. This is because the x* (meaning zero or more "x" characters) will be able to match the "nothing" at the beginning of the string, since the null string happens to be zero characters wide and there's a null string just sitting there plain as day before the "f" of "fred".[28]

[28] Even the authors get caught by this from time to time.

There's one other thing you need to know. By default quantifiers apply to a single preceding character, so /bam{2}/ will match "bamm" but not "bambam". To apply a quantifier to more than one character, use parentheses. So to match "bambam", use the pattern /(bam){2}/.

Minimal Matching

If you were using an ancient version of Perl and you didn't want greedy matching, you had to use a negated character class. (And really, you were still getting greedy matching of a constrained variety.)

In modern versions of Perl, you can force nongreedy, minimal matching by use of a question mark after any quantifier. Our same username match would now be /.*?:/. That .*? will now try to match as few characters as possible, rather than as many as possible, so it stops at the first colon rather than the last.

Nailing Things Down

Whenever you try to match a pattern, it's going to try to match in every location till it finds a match. An anchor allows you to restrict where the pattern can match. Essentially, an anchor is something that matches a "nothing", but a special kind of nothing that depends on its surroundings. You could also call it a rule, or a constraint, or an assertion. Whatever you care to call it, it tries to match something of zero width, and either succeeds or fails. (If it fails, it merely means that the pattern can't match that particular way. The pattern will go on trying to match some other way, if there are any other ways to try.)

The special character string \b matches at a word boundary, which is defined as the "nothing" between a word character (\w) and a non-word character (\W), in either order. (The characters that don't exist off the beginning and end of your string are considered to be non-word characters.) For example,

/\bFred\b/

would match both "The Great Fred" and "Fred the Great", but would not match "Frederick the Great" because the "de" in "Frederick" does not contain a word boundary.

In a similar vein, there are also anchors for the beginning of the string and the end of the string. If it is the first character of a pattern, the caret (^) matches the "nothing" at the beginning of the string. Therefore, the pattern /^Fred/ would match "Frederick the Great" and not "The Great Fred", whereas /Fred^/ wouldn't match either. (In fact, it doesn't even make much sense.) The dollar sign ($) works like the caret, except that it matches the "nothing" at the end of the string instead of the beginning.[29]

[29] This is a bit oversimplified, since we're assuming here that your string contains only one line. ^ and $ are actually anchors for the beginnings and endings of lines rather than strings. We'll try to straighten this all out in Chapter 2, The Gory Details (to the extent that it can be straightened out).

So now you can probably figure out that when we said:

next LINE if $line =~ /^#/;

we meant "Go to the next iteration of LINE loop if this line happens to begin with a # character."

Backreferences

We mentioned earlier that you can use parentheses to group things for quantifiers, but you can also use parentheses to remember bits and pieces of what you matched. A pair of parentheses around a part of a regular expression causes whatever was matched by that part to be remembered for later use. It doesn't change what the part matches, so / \d+/ and /(\d+)/ will still match as many digits as possible, but in the latter case they will be remembered in a special variable to be backreferenced later.

How you refer back to the remembered part of the string depends on where you want to do it from. Within the same regular expression, you use a backslash followed by an integer. The integer corresponding to a given pair of parentheses is determined by counting left parentheses from the beginning of the pattern, starting with one. So for example, to match something similar to an HTML tag (like "<B>Bold</B>", you might use /<(.*?)>.*?<\/\1>/. This forces the two parts of the pattern to match the exact same string, such as the "B" above.

Outside the regular expression itself, such as in the replacement part of a substitution, the special variable is used as if it were a normal scalar variable named by the integer. So, if you wanted to swap the first two words of a string, for example, you could use:

s/(\S+)\s+(\S+)/$2 $1/

The right side of the substitution is really just a funny kind of double-quoted string, which is why you can interpolate variables there, including backreference variables. This is a powerful concept: interpolation (under controlled circumstances) is one of the reasons Perl is a good text-processing language. The other reason is the pattern matching, of course. Regular expressions are good for picking things apart, and interpolation is good for putting things back together again. Perhaps there's hope for Humpty Dumpty after all.


Control Structures		List Processing