3. Regular expressions
Regular expressions are symbolic notations used to identify patterns in text. They are supported by many command line tools and by most of programming languages to facilitate the solution of text manipulation problems.
-
We will test regular expressions with
grep(which means "global regular expression print"). It searches text files for the occurrence of text matching a specified regular expression and outputs any line containing a match to standard output.ls /usr/bin | grep zipIn order to explore
grep, let's create some text files to search:ls /bin > dirlist-bin.txtls /usr/bin > dirlist-usr-bin.txtls /sbin > dirlist-sbin.txtls /usr/sbin > dirlist-usr-sbin.txtls dirlist*.txtWe can do a simple search on these files like this:
grep bzip dirlist*.txtIf we are interested only in the list of files that contain matches, we can use the option
-l:grep -l bzip dirlist*.txtConversely, if we want to see a list of files that do not contain a match, we can use
-L:grep -L bzip dirlist*.txt -
While it may not seem apparent, we have been using regular expressions in the searches we did so far, albeit very simple ones. The regular expression "bzip" means that a line will match if it contains the letters "b", "z", "i", "p" in this order and without other characters in between.
Besides the literal characters, which represent themselves, we can also use metacharacters in a pattern. For example a dot (
.) matches any character:grep -h '.zip' dirlist*.txtThe option
-hsuppresses the output of filenames.Notice that the
zipprogram was not found because it has only 3 letters and does not match the pattern. -
The caret (
^) and dollar sign ($) are treated as anchors in regular expressions. This means that they cause the match to occur only if the regular expression is found at the beginning of the line (^) or at the end of the line ($):grep -h '^zip' dirlist*.txtgrep -h 'zip$' dirlist*.txtgrep -h '^zip$' dirlist*.txtNote that the regular expression '
^$' will match empty lines. -
Using bracket expressions we can match a single character from a specified set of characters:
grep -h '[bg]zip' dirlist*.txtIf the first character in a bracket expression is a caret (
^), then any character will be matched, except for those listed:grep -h '[^bg]zip' dirlist*.txtThe caret character only invokes negation if it is the first character within the bracket expression; otherwise it loses its special meaning and becomes an ordinary character in the set:
grep -h '[b^g]zip' dirlist*.txt -
If we want to find all lines that start with an uppercase letter, we can do it like this:
grep -h '^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' dirlist*.txtWe can do less typing if we use a range:
grep -h '^[A-Z]' dirlist*.txtIf we want to match any alphanumeric character (all the letters and digits), we can use several ranges, like this:
grep -h '^[A-Za-z0-9]' dirlist*.txtHowever the dash (
-) character in this example stands for itself, does not make a range:grep -h '^[-AZ]' dirlist*.txtBesides ranges, another way to match groups of characters is using POSIX character classes:
grep -h '^[[:alnum:]]' dirlist*.txtls /usr/sbin/[[:upper:]]*Other character classes are:
[:alpha:],[:lower:],[:digit:],[:space:],[:punct:](for punctuation characters), etc. -
With a vertical bar (
|) we can define alternative matching patterns:echo "AAA" | grep AAAecho "BBB" | grep BBBecho "AAA" | grep 'AAA\|BBB'echo "BBB" | grep -E 'AAA|BBB'echo "CCC" | grep -E 'AAA|BBB'echo "CCC" | grep -E 'AAA|BBB|CCC'The option
-Etellsgrepto use extended regular expressions. With extended regular expressions the vertical bar (|) is a metacharacter (used for alternation) and we need to escape it (with\) to use it as a literal character. With basic regular expressions (without the option-E) the vertical bar is a literal character and we need to escape it (with\) if we want to use it as a metacharacter. -
Other metacharacters that are recognized by extended regular expressions, and which behave similar to
|are:(,),{,},?,+. For example:grep -Eh '^(bz|gz|zip)' dirlist*.txtNote that this is different from:
grep -Eh '^bz|gz|zip' dirlist*.txtIn the first example all the patterns are matched at the beginning of the line. In the second one only
bzis matched at the beginning.