Regular Expressions
A
regular expression, also
known as a RegEx, is a shorthand descriptor of a pattern to match against
text located somewhere within a file. There are many flavors of regular
expressions, which can frustrate those new to learning the language. The grep
command uses BASIC Regular syntax.
The origins of grep
Ken Thompson, author of grep
was using
ed
at the time (this was
before Bill Joy et. al had written ex
and later vi
), and needed to make a
bunch of substitutions, and he kept having to do it. He wanted to edit a bunch
of lines (he couldn't see them, because this was the era before text-user
interfaces (TUIs) were a thing. So he needed to, before editing some lines at
the same time, globally print them first to know what he was matching. He needed
globally for his regular expression to print g/re/p
.
Character Classes
Character classes are the [
... ]
set of characters
-
[0-4]
will match the digits 0, 1, 2, 3, 4 -
[A-Z]
will not match thea
inapple
, but will match theA
inApple
.
Negation can be specified by prefixing the character class with the ^
character.
-
[^abc]
will not match any line that contains a char that isn'ta
,b
, orc
-
[^c]at
will match "hat" and "bat" but not "cat"
Different Flavors, Different Syntax
These are some differences between various flavors. For example, in MacOS,
grep
supports \d
to represent digits, but supposedly GNU's version of grep
does not. grep
supports using \<
and \>
to represent word boundaries, but
Perl compatible regular expressions use \b
to represent a word boundary.
Furthermore, Perl compatible regular expressions support capture arguments, but
grep
does not.
Special Characters
RegEx | Meaning |
---|---|
\d | digit |
\D | non-digit |
\w | alphanumeric |
\W | non-alphanumeric |
\s | whitespace |
\t | tab |
\n | newline |
\r | carriage return |
Changes
The following meta-characters have either been added, or no longer require being escaped to use their special properties. To search for the literal character, prepend them with a backslash.
# [Search for line containing 'this' or 'that']
grep -E 'this|that' file.txt
# [Search for a line containing 'abcabc']
grep -E '(abc){2}' file.text
Luckily, vim
, sed
, and grep
all support the same -E
flag to use extended
regular expressions.
# [Use ex to replace hello with world on line 1]
vim -E -nsc '1s/this|that/world/e' -cx file.txt
# [Use sed to replace hello with world on line 1]
sed -E -i -e '1/this|that/world/' file.txt
Metacharacters
Although lots of documentation will say the following meta-characters are unsupported in grep
they work fine in the standard grep
for MacOS.
\d
digit\D
non-digit\s
whitespace (non-newline)\S
non-whitespace\w
alphanumeric (all in\a
,\d
and also_
)\W
non-alphanumeric
RegEx History
DFA or NFA
Technically a regex is not a regular expression. It might as well be, as far as everyday use is concerned, but this is the history.
In the 1950s, famous mathematician Stephen Kleene, known today as the father of Regular Expressions, outlined the set of operations that a machine must support in order to be declared as a Deterministic Finite Automaton (DFA).
A DFA is capable of understanding expressions such as "the character A must be
matched with 0 or more times". The syntax to declare this regular expression was
A*
(sound familiar?)
That star is now referred to as a Kleene star, named after the very mathematician who laid the groundwork for regular expressions.
Eventually the functionality of regular expressions was expanded, such by the
perl
language in the 1980s leading to some modern regular expression programs
becoming classified as Non-Deterministic Finite Automatons (NFAs)
RegEx Scrapbook
Here is where I'll be saving regular expressions I wanted to be able to refer back to at some point
-
Select all Markdown fenced code blocks that are indented using 2 spaces
/ ```.*$\n\( .*$\n\)\+ ```$