4. More regex examples
-
Suppose that we are solving a crossword puzzle and we need a five letter word whose third letter is "j" and last letter is "r". Let's try to use
grep
and regex to solve this.Fist of all make sure that we have a dictionary of words installed:
sudo apt install wbritish
ls /usr/share/dict/
less /usr/share/dict/words
cat /usr/share/dict/words | wc -l
Now try this:
grep -i '^..j.r$' /usr/share/dict/words
The option
-i
is used to ignore the case (uppercase, lowercase).The regex pattern
'^..j.r$'
will match lines that contain exactly 5 letters, where the third letter isj
and the last one isr
. -
Let's say that we want to check a phone number for validity and we consider a phone number to be valid if it is in the form
(nnn) nnn-nnnn
or in the formnnn nnn-nnnn
wheren
is a digit. We can do it like this:echo "(555) 123-4567" | \
grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'echo "555 123-4567" | \
grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'echo "AAA 123-4567" | \
grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'Since we are using the option
-E
(for extended), we have to escape the parentheses\(
and\)
so that they are not interpreted as metacharacters.If we use basic regular expressions (without
-E
), then we don't need to escape the parentheses, but in this case we will have to escape the question marks (\?
) so that they are interpreted as metacharacters:echo "(555) 123-4567" | \
grep '^(\?[0-9][0-9][0-9])\? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'The question mark as a metacharacter means that the parentheses before it can be zero or one time.
-
Using the metachars
{}
we can express the number of required matches. For example:echo "(555) 123-4567" | \
grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'The expression
{3}
matches if the preceding element occurs exactly 3 times.We could also replace
?
by{0,1}
, or{,1}
:echo "(555) 123-4567" | \
grep -E '^\({0,1}[0-9]{3}\){,1} [0-9]{3}-[0-9]{4}$'echo "555 123-4567" | \
grep -E '^\({0,1}[0-9]{3}\){,1} [0-9]{3}-[0-9]{4}$'In general,
{n,m}
matches if the preceding element occurs at leastn
times, but no more thanm
times. These are also valid:{n,}
(at leastn
times), and{,m}
(at mostm
times). -
Similar to
?
which is equivalent to{0,1}
, there is also*
which is equivalent to{0,}
(zero or more occurrences), and+
which is equivalent to{1,}
(one or more, at least one occurrence):Let's say that we want to check if a string is a sentence. This means that it starts with an uppercase letter, then contains any number of upper and lowercase letters and spaces, and finally ends with a period. We could do it like this:
echo "This works." | grep -E '[A-Z][A-Za-z ]*\.'
echo "This Works." | grep -E '[A-Z][A-Za-z ]*\.'
echo "this does not" | grep -E '[A-Z][A-Za-z ]*\.'
Or like this:
echo "This works." | grep -E '[[:upper:]][[:upper:][:lower:] ]*\.'
Note: In all these cases we have to escape the period (
\.
) so that it matches itself instead of any character. -
Here is a regular expression that will only match lines consisting of groups of one or more alphabetic characters separated by single spaces:
echo "This that" | grep -E '^([[:alpha:]]+ ?)+$'
echo "a b c" | grep -E '^([[:alpha:]]+ ?)+$'
echo "a b c" | grep -E '^([[:alpha:]]+ ?)+$'
Does not match because there are two consecutive spaces.
echo "a b 9" | grep -E '^([[:alpha:]]+ ?)+$'
Does not match because there is a non-alphabetic character.
-
Let's create a list of random phone numbers for testings:
echo $RANDOM
echo $RANDOM
echo ${RANDOM:0:3}
for i in {1..10}; do \
echo "${RANDOM:0:3} ${RANDOM:0:3}-${RANDOM:0:4}" >> phonelist.txt; \
donecat phonelist.txt
for i in {1..100}; do \
echo "${RANDOM:0:3} ${RANDOM:0:3}-${RANDOM:0:4}" >> phonelist.txt; \
doneless phonelist.txt
cat phonelist.txt | wc -l
You can see that some of the phone numbers are malformed. We can display those that are malformed like this:
grep -Ev '^[0-9]{3} [0-9]{3}-[0-9]{4}$' phonelist.txt
The option
-v
makes an inverse match, which means thatgrep
displays only the lines that do not match the given pattern. -
Regular expressions can be used with many commands, not just with
grep
.For example let's use them with
find
to find the files that contain bad characters in their name (like spaces, punctuation marks, etc):touch "bad file name!"
ls -l
find . -regex '.*[^-_./0-9a-zA-Z].*'
Different from
grep
,find
expects the pattern to match the whole filename, that's why we are appending and prepending.*
to the pattern.We can use regular expressions with
locate
like this:locate --regex 'bin/(bz|gz|zip)'
We can also use them with
less
:less phonelist.txt
We can press
/
and write a regular expression, andless
will find and highlight the matching lines. For example:/^[0-9]{3} [0-9]{3}-[0-9]{4}$
The invalid lines will not be highlighted and will be easy to spot.
Regular expressions can also be used with
zgrep
like this:cd /usr/share/man/man1
zgrep -El 'regex|regular expression' *.gz
It will find man pages that contain either "regex" or "regular expression". As we can see, regular expressions show up in a lot of programs.