[sf-lug] regular expressions: match 1 to 4 char word; does . match etc.

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sun Dec 14 13:10:00 PST 2008


> Date: Wed, 10 Dec 2008 08:37:02 -0800
> From: jim <jim at well.com>
> Subject: regex: how to match any one to four character word in a file
>
> i have a text file with over 100000 words (and lines,
> one word per line). i wanna grep out all words that
> are from one to four characters, e.g. 'a' or 'and'
> or "fact" but not "apple" or "zounds".
>
> $  grep '[.]{4}' words.txt
> got me a newline.

Well, from the above specification, if we presume that "grep out" means
output the matched lines, and use grep to do it, and if we know we have
exactly one word per line, if it also happens to be the case that there
are no non-word characters other than the newline terminating each
line, we could use:
$ grep '^.\{1,4\}$' words.txt
or
$ grep -v '.....' words.txt

If the word on each line may have non-word characters on either or both
sides of it on the line, then things get a bit more complex.  We need
to know/define exactly what we do and don't consider to be a word
character.  It may even get more complex than that, as position may
also be significant.  E.g. let's say we consider hyphen (-) as a word
character, and hyphenated words to be a word, but don't allow a word to
start or end with a hyphen, and let's say a word can't contain two or
more consecutive hyphens.  Also, let's say we're dealing just with
ASCII, and consider words to only contain ASCII letters, and hyphen,
and word can't start or end with a hyphen or contain two or more
consecutive hyphens.  And lets say we may have lines with other than
exactly one word on them.
> words.txt
for s in '' a at cat - -- -cat- c--a -a-t- scat scats four-four \
four--four xfive-xfive xfive--xfive
do
     echo "$s" >> words.txt
     echo \!"$s"\! >> words.txt
done
unset s

LC_ALL=C grep -E \
-e '(^|[^A-Za-z])[A-Za-z]{1,4}([^A-Za-z]|$)' \
-e '(^|[^A-Za-z])[A-Za-z](-|-[A-Za-z]|[A-Za-z]-)[A-Za-z]([^A-Za-z]|$)' \
words.txt
The above gets us the lines:
a
!a!
at
!at!
cat
!cat!
-cat-
!-cat-!
c--a
!c--a!
-a-t-
!-a-t-!
scat
!scat!
four-four
!four-four!
four--four
!four--four!
Note that it's not matching c--a and four--four as words themselves,
but rather as c, a, and four being words bounded by non-word character
- (hyphen), as in our implementation above, we allowed hyphen as an
interior word character, but also as a non-word boundary character ...
though we could have chosen a different interpretation and
implementation.  Note also that the -o option to (GNU) grep wouldn't
help much to show exactly what's going on there, as the pattern we use
to match (most notably to exclude words that are too long) includes
both word, and non-word characters.  Although the "standard" Unix
regular expression stuff is quite powerful, for the really hairy stuff,
perl's regular expression - and other - capabilities, may be much more
suitable (GNU grep does also have the -P and  --perl-regexp options).

Let's say we want to use perl's \w definition of a word character (and
\W for non-word character), and want to output the words, one per line,
regardless of the number of words on input lines.
Perl's RE modifier x can also make for much more readable regular
expressions, e.g.
/^(?:|.*?\W)(\w{1,4})(\W.*|)$/ox
isn't nearly as human readable as the RE that follows in the perl below:
perl -n -e '
     while(
         # match is true if we have qualified word in $_
         /
             ^
             # do not save any stuff before first qualified word on $_
             (?:|.*?\W)

             # first qualified word on $_
             (\w{1,4})

             # any stuff after first qualified word on $_
             (\W.*|)
             $
         /ox
     ){
         print "$1\n";   # qualified word on line
         $_=$2;          # anything after it on line
     }
' words.txt
and running that, gives us:
a
a
at
at
cat
cat
cat
cat
c
a
c
a
a
t
a
t
scat
scat
four
four
four
four
four
four
four
four

More RE fun:  In /usr/share/dict/words, how many lines are five
character palindromes (or equivalent word list for your distribution,
or other similar source)?  Precise count will vary depending on the file
one has, but to find it:
Answer see: [1] (at the very end)


> Date: Wed, 10 Dec 2008 13:25:33 -0800
> From: Charles-Henri Gros <chgros at coverity.com>
>
> newlines / carriage returns are not matched by '.'

Carriage return (^M <CR> <RETURN> \r ASCII \015 13 0xD) does match . in
RE.
newline / linefeed (^J \n ASCII \012 10 0xA) does not match . in RE,
with some limited exceptions (e.g. perl with s modifier, embedded
newline in pattern space in sed).
cmp(1) will give us a 0 return value if the files compare identically
rn - carriage return and newline (I include the newline, because many
text/line oriented utilities may have unspecified behaviors if they're
given non-empty input that doesn't end with a newline).
n - just newline
$ echo -e '\015' > rn
$ echo > n
$ < rn od -t o1
0000000 015 012
0000002
$ < n od -t o1
0000000 012
0000001
RE . matching carriage return:
$ < rn grep . | 2>>/dev/null cmp - rn; echo $?

$ < rn sed -ne '/./p' | 2>>/dev/null cmp - rn; echo $?

$ < rn awk '/./' | 2>>/dev/null cmp - rn; echo $?

RE . not matching newline:
$ < n grep . | 2>>/dev/null cmp - rn; echo $?
1
$ < n sed -ne '/./p' | 2>>/dev/null cmp - rn; echo $?
1
$ < n awk '/./' | 2>>/dev/null cmp - rn; echo $?
1
RE . matching embedded newline in pattern space in sed:
< n sed -ne '

     # put it in the hold space
     h

     # append it to the hold space, separated by embedded newline
     H

     # exchange hold and pattern space (so now our pattern space has just
     # an embedded newline in it)
     x

     # let us show what is in our pattern space ... but in a more visual
     # way, let us mark the start with ^ and the end with $
     s/^/^/
     s/$/$/
     # print (output) it
     p

     # okay, let us strip off the ^ and $ we stuck on there
     s/^\^//
     s/\$$//

     # everywhere we have left that matches . (just our embedded newline)
     # replace with X
     s/./X/g

     # print the pattern space
     p
'
^
$
X
RE . matching newline in perl with s modifier.
perl -e '
     $_="\n";
     print "without s:\n";
     if(! /./o){print "not "}print "matched\n";
     print "with s:\n";
     if(! /./os){print "not "}print "matched\n";
'
without s:
not matched
with s:
matched

footnote:
1. $ grep -i '^\(.\)\(.\).\2\1$' /usr/share/dict/words | wc -l





More information about the sf-lug mailing list