[sf-lug] On learning regular expressions

Mon Dec 29 00:20:12 PST 2008

On learning regular expressions

> Date: Sun, 14 Dec 2008 16:15:28 -0800
> From: "Jesse Zbikowski" <embeddedlinuxguy at gmail.com>
>
> On Sun, Dec 14, 2008 at 1:10 PM, Michael Paoli
> <Michael.Paoli at cal.berkeley.edu> wrote:
>> Although the "standard" Unix
>> regular expression stuff is quite powerful, for the really hairy stuff,
>> perl's regular expression - and other - capabilities, may be much more
>> suitable
>
> I agree with this and I would further suggest to learn one regular
> expression syntax (e.g. Perl's) and stick with it.  If you try to
> write command lines mixing RE syntax from grep, egrep, find, sed, etc,
> you are going to end up making mistakes.  You forget where . means
> "any character" and where it is a literal, when you glob with * and
> when you need .*, etc etc etc.  Get used to piping your commands to
> perl -ne, or whatever is the equivalent in Python / Ruby / your
> favorite glue language.  Forget about the regular expression
> capabilities in the standard Unix commands unless you are a masochist.

I disagree ... somewhat ;-)

Sure, all that "regular expression" and globbing/wildcard stuff, and its
many variants can be quite confusing - at least at first.

I'd recommend a more familial approach.  The regular expression stuff
essentially breaks down into two or three families, depending how one
counts, and within each family, its members do look an awful lot like
each other.

First, start simple.  In many (most?) contexts, it's not called or
referred to as "regular expressions", but in at least some (e.g. some
vendors, such as Hewlett-Packard's HP-UX) it is.  The basic Bourne/POSIX
shell wildcarding/globbing - mostly just the emphasis on what's pretty
much common to all of them - more exciting additions/extensions can be
covered later.  E.g. know what these do:
?
*
[ (as start of a character class)
and what - ! and ] do within a character class - and depending upon
their placement ... and probably also cover when [ and : can do
something more exciting in a character class (if nothing else, to avoid
accidentally tripping over it, for starters).  Likewise ^ within
character class - for at least some shells it's synonymous with !
(probably just to make it a bit more RE-like in syntax).
That is family zero.  Within it is not only Bourne and POSIX shells and
their brethren (and close cousins, etc.), but also find, and
perl's globbing, and with some slight variation, cpio's pattern
matching (and probably some other utilities that aren't immediately
jumping to mind).
The above is relatively simple, and probably a good place to start, if
it's not already been covered.  Once that's rather to quite well
understood ...

Secondly, we have the most common Unix/Linux regular expression stuff
("basic" and "extended" regular expressions, but not perl regular
expressions - "extended" only adds a modest bit beyond "basic").
Ah, but which/where, they do tend to vary a bit, don't they?  Well, yes,
but they don't vary by much, and where and how they vary is quite well
documented.  So, ... start with which one?  Well, at least historically,
the best place to start was ed(1).  :-)  The reason being, all the other
regular expression stuff used to be described either as using the same
regular expression matching as in ed(1) or was explained just in terms
of how it differed from the regular expression matching of ed(1).
Also, at least historically, the description of regular expressions in
ed(1) was quite concise - about one page.  Even in the GNU ed(1) at
my fingertips, that description, and with a fair number of GNU
extensions, still (barely) fits within about two pages.
So, once one has then learned the regular expression syntax of ed(1),
a huge realm of Unix/Linux regular expression syntax is then opened
up - much of it being the same as, or only slightly different, than that
of ed(1), e.g. grep, egrep/grep -E, sed, awk, vi, ex, expr, etc.
Anyway, I'd strongly recommend well covering the above regular
expression family, before diving into perl's regular expression family.

Thirdly, we have perl regular expressions.  And if one thought the
basic regular expression capabilities of Unix/Linux beat the heck out of
typical basic wildcard matching, perl raises it yet another level.
While continuing to still pretty much handle the basic Unix/Linux
regular expression stuff, perl extends it in ways that make it much more
efficient and convenient for the programmer, and make it much more
feasible - and sometimes even just plain possible - to do some things
with perl regular expressions that one can't do, or is very painful to
do, with the basic Unix/Linux regular expression capabilities.  Along
with that, however, comes a fair bit of complexity - e.g. basic
reference on perl regular expressions (like that man page section) is
about 20 pages - compared to about 1 for the (original) regular
expression description for ed(1) (and still about 2 pages for current
GNU description).  Once one has well covered perl regular
expressions, that opens up the whole family that uses perl regular
expressions, ... they too with their variations (some limitations,
changes, extensions, etc.) - but in most regards matching up to how perl
does regular expressions.  These include Apache, many GNU utilities
(typically with an option to use perl regular expressions), and other
programming languages and utilities that support perl regular
expressions (again, often with some variations).

> Date: Sun, 14 Dec 2008 18:12:12 -0800 (PST)
> From: Asheesh Laroia <asheesh at asheesh.org>
>
> On Sun, 14 Dec 2008, Jesse Zbikowski wrote:
>
>> Forget about the regular expression capabilities in the standard Unix
>> commands unless you are a masochist.
>
> Or (my preference) get used to demand "E"xtended regular expressions.
> $ grep -E
> $ sed -r
>
> Those two give you a perl-ish modern set of regular expressions where you
> don't find yourself cursing the tool for being stuck in 1988.  It's
> probably not "full" perl-compatible regular expressions, but I've always
> been happy with how close it is.
>
> I firmly believe we need to stop teaching "grep and sed" and start
> teaching people to use "grep -E and sed -r".

Actually, egrep and grep -E are "just" extended regular expressions.
Historically that's just basic regular expressions plus | for
alternation, and ? for zero or one, and + for one or more.  Doesn't
come very close to perl's regular expression capabilities, but does add
a bit beyond basic regular expressions.  egrep and grep -E are quite
standard.  The -r option to sed, however, is a GNU extension, and not
standard.  In any case, GNU basic and extended
regular expressions commonly add a slight bit more than POSIX/SUS.

Again, for regular expressions, I'd recommend the familial approach.
Learn them by their families.  There isn't all that much variation in
the regular expressions *within* a family.

references (e.g.):
Linux Standard Base: http://www.linuxfoundation.org/en/LSB
which, e.g. for sed(1):
http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/command.html
mostly ends up referring to:
(SUS references - free as in beer, not freedom):
The Single UNIX(R) Specification, Version 3:
http://www.unix.org/single_unix_specification/
sed - stream editor:
http://www.opengroup.org/onlinepubs/000095399/utilities/sed.html
Regular Expressions:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html