Regular Expressions

Anything but regular...

Monday, Feb 24, 2020

Regular Expressions

A regular expressions, also known as a RegEx can be used to locate information in a file. There are many flavors of regular expressions, such as grep and egrep and pcregrep. Depending on which you use, the syntax will be slightly different. I prefer pcregrep as it’s the closest to the regular expression engines used in languages like python, nodejs, and java.

Different Flavors, Different Syntax

These are some differences between various flavors.

# returns '123'
echo "var=123" | pcregrep -o1 '=(\d{3})' file.txt

Meta-characters

. any character

^ beginning of line, or “not”

(it depends on the context, explained below)

[] set of characters

{} repetition of characters

() capture groups

$ end of line

&

*

In order to refer to these chars literally, you have to escape them e.g. \.

All other characters you can refer to literally, and some will perform a special function if they are first escaped with a \

Special Characters

RegEx Meaning Does Match Also Matches Doesn’t Match
\d digit 1 2 a
\D non-digit a > 3
\w alphanumeric a 1 >
\W non-alphanumeric > # 1
\s whitespace
\t tab
\n newline
\r carriage return

Changes

The following meta-characters have either been added, or no longer require being escaped to use their special properties. To search for the literal character, prepend them with a backslash.

# [Search for line containing 'this' or 'that']
egrep 'this|that' file.txt
# [Search for a line containing 'abcabc']
egrep '(abc){2}' file.text

grep

grep was written in 1974 by Ken Thompson as part of the original UNIX operating system. Before grep, people using the original file editing software ed would often have to type ed g/re/p to perform a global regular expression print. This would print out any lines that matched the criteria and perform the operations specified.

Luckily, vim and sed both support the same -E flag to use extended regular expressions.

# [Use ex to replace hello with world on line 1]
vim -E -nsc '1s/this|that/world/e' -cx file.txt
# [Use sed to replace hello with world on line 1]
gsed -E -i -e '1/this|that/world/' file.txt

Metacharacters

Although lots of documentation will say the following meta-characters are unsupported in grep they work fine in the standard grep for MacOS.

RegEx History

DFA or NFA

Technically a regex is not a regular expression. It might as well be, as far as everyday use is concerned, but this is the history.

In the 1950s, famous mathematician Stephen Kleene, known today as the father of Regular Expressions, outlined the set of operations that a machine must support in order to be declared as a Deterministic Finite Automaton (DFA).

A DFA is capable of understanding expressions such as “the character A must be matched with 0 or more times”. The syntax to declare this regular expression was A* (sound familiar?)

That star is now referred to as a Kleene star, named after the very mathematician who laid the groundwork for regular expressions.

Eventually the functionality of regular expressions was expanded, such by the perl language in the 1980s leading to some modern regular expression programs becoming classified as Non-Deterministic Finite Automatons (NFAs)