[sf-lug] regex: how to match any one to four character word in a file

Clyde Jones slash5toaster at gmail.com
Wed Dec 10 14:06:32 PST 2008


On Wed, Dec 10, 2008 at 13:25, Charles-Henri Gros <chgros at coverity.com> wrote:
> Jeff Bragg wrote:
>> That won't work (I just verified for myself that it doesn't by trying it).
>> Those are line anchors, not word anchors.  It will only match lines that
>> have no more than 4 characters on them (including newlines, carriage
>> returns, etc).
>>
>> Something like '\w\{1,4\}' should work, though in practice it doesn't seem
>> to honor the maximum match condition (4 in this case).
> newlines / carriage returns are not matched by '.'
>
> Also, your test is not much better, since it doesn't use any anchors at
> all, so it will do a partial match (hence the "not honoring the maximum
> match condition")
>
> Word boundary is \b (\< for left only, \> for right only
>
> So you can use:
> '\b\w\{1,4\}\b'
> or
> '\<\w\{1,4\}\>'
>
> but in either case, it will print the whole line that contained the
> matching word. Use grep -o to only print the matching text.
>
> In any case, it said one word per line, so line anchors should work.
>
> --
> Charles-Henri

The fragment I sent uses tr to convert a text stream to a single word
per line, so the anchors work and force the output to be a single word
per line.


  I have a script that was inspired by a Trivial Pursuit question[1].
This counts the occurrence of a word in an arbitrary sized text.

The core of that script is:

cat -sv $1 | tr -s '[:punct:][:blank:]' '\012' | tr '[:upper:]'
'[:lower:]' | grep -i $2 | sort -fdb | uniq -c

where $1 is the text file and $2 is the string you want to find.  This
does not differentiate between corn, cornflower and scorn.

That would be

cat -sv $1 | tr -s '[:punct:][:blank:][:space:]' '\012'| tr
'[:upper:]' '[:lower:]' | grep -i ^$2$ | sort -fdb | uniq -c

Clyde

[1] The question was "What is the most mentioned grain in the King
James Bible" - the answer is Corn

-- 
We are what we think. All that we are arises with our thoughts. With
our thoughts, we make the world.
-Buddha
Jay Leno  - "The reason there are two senators for each state is so
that one can be the designated driver."




More information about the sf-lug mailing list