Slicing and dicing data with regular expressions

Repetition, Repetition

So far, I've shown literal, positional, and two kinds of alternation operators. With these operators alone, you can match almost any pattern of a predictable length. For example, you could ensure a username started with a letter and was followed by exactly seven letters or numbers with the regex , but that approach is a little unwieldy. Moreover, it only matches usernames of exactly eight characters.

A regular expression can also include repetition operators. A repetition operator specifies amounts, such as none, 1, or more; 1 or more; 0 or one; 5 to 10; and exactly 3. A repetition modifier must be combined with other patterns; the modifier has no meaning by itself. As an example, the regex ^{2,7}$ implements the username filter desired earlier: a username is a string beginning with a letter, followed by at least two but not more than seven letters or numbers, followed by the end of the string.

The location anchors are essential here. Without the two positional operators, a username of arbitrary length would erroneously be accepted. Why? Consider the regex ^{2,7}. It asks the question: "Does the string begin with a letter, followed by two to seven letters?" But it makes no mention of a terminating condition. Thus, the string samuelclemens fits the criteria, but is obviously too long to be valid. If your match must be a specific length, don't forget to include delimiters for the beginning and end of the desired pattern.

Following are some other samples:

  • {2,} finds two or more repeats. The regex ^G{2,}gle matches Google, Gooogle, Goooogle, and so on.
  • Repetition modifiers ?, +, and * find no or 1, 1 or more, and 0 or more repeats, respectively (e.g., ? is shorthand for {0,1}). The regex boys? matches boy or boys. The regex Goo?gle matches Gogle or Google. The regex Goo+gle matches Google, Gooogle, Goooogle, and so on. The construct Goo*gle matches Gogle, Google, Gooogle, and on and on.
  • Repetition modifiers can be applied to individual literals, as shown immediately above, and can also be applied to other, more complex combinations. Use the parentheses just as you do in mathematics to apply a modifier to a subexpression.

Consider the file test.txt containing lines with typos:

The rain in Spain falls mainly
on the the plain.
It was the best of of times;
it was the worst of times.

Entering the following command:

grep -i -E '(\b(of|the)\ ){2,}' test.txt

produces on the the plain. It was the best of of times;. The regex operator \b matches a word boundary, or (\W\w|\w\W). The regex reads: "A sequence of whole words 'the' or 'of', followed by a space." You might be asking why the space is necessary: \b is the empty string at the beginning or end of a word. You have to include the character(s) between the words; otherwise, the regex fails to find a match.

Capture the Needle

Finding text is a common problem, but more often than not, you want to extract a particular snippet of text once it's found. In other words, you want to keep the needle and discard the haystack.

A regular expression extracts information via capture. To isolate the text you want, surround the pattern with parentheses. Indeed, you already used parentheses to collect terms because parentheses capture automatically (unless they are disabled).

To show a capture, I'll switch to Perl (grep does not support capture because its purpose is to print lines containing a pattern). grep's regex operators are a small subset of what Perl has to offer. If you type this command

perl -n -e '/^The\s+(.*)$/print "$1\n"' heroes.txt

the result should be Tick. The perl -e lets you run a Perl program right from the command line. perl -n runs the program once on every line of the file. The regex portion of the command, the text between the slashes, says: "Match the literals at the beginning of the string, then 'T', 'h', 'e', followed by one or more white space character(s), \s+; then capture every character to the end of the string." The rest of the Perl program prints what was captured.

Individual Perl captures are placed in special Perl variables named $1, $2, and so on, one variable per capture described in the regex. Each nested set of parentheses, counting from the left, is placed in the next special, numerical variable. Consider the following:

$ perl -n -e '/^(\w+)-(\w+)$/print "$1 $2\n"'

which yields: Spider Man, Ant Man, Spider Woman:

Capturing text of interest just scratches the surface. Once you can pinpoint material, you can surgically replace it with other material.

Buy this article as PDF

Express-Checkout as PDF

Pages: 4

Price $2.95
(incl. VAT)

Buy Raspberry Pi Geek

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content