[conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)

Michael Paoli Michael.Paoli at cal.berkeley.edu
Thu Feb 18 05:45:16 PST 2021


POSIX gives no -E option to sed:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
however both GNU and BSD sed support -E option to use ERE rather than
BRE.
That does also change the REs syntax, notably
\{\} and \(\) vs. {} and ().
GNU's sed and grep support POSIXLY_CORRECT, but of them, only
GNU's sed supports the --posix option.
https://www.gnu.org/software/sed/manual/sed.html
https://www.gnu.org/software/grep/manual/grep.html

With -E:
GNU/Linux & BSD both give same expected results on these:
$ echo 'xxo o x  ' | sed -E -ne '/^(...){0,2}([ox])\2\2/{s/.*/matched/p;}'
$ echo 'xxo o x  ' | grep -E -e '^(...){0,2}([ox])\2\2'
$ echo 'xxo o x  ' | sed -E -ne 's/^(...){0,2}([ox])\2\2/1:\12:\2\&:&/p'
$ echo 'xox' | grep -E '^([xo])\1\1$'
$

Working to find minimal case of unexpected BRE sed behavior on BSD
("vs." the expected on GNU),
seems we have some unexpected interaction between use of
subexpression \(\)
and
repeat/duplication count specifier \{m,n\} \{m\} \{m,\} \{,m\}
BSD:
$ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
$ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
YYxx
$ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
YYxxz
$ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
YYxxz
$ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
$ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
$ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
YYxyxy
$

GNU/Linux (all as expected):
$ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
YYxx
$ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
YYxxz
$ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
YYxxz
$ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
$ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
YYxyxy
$

POSIX BRE & ERE
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
doesn't specify back-references in EREs, so we have an implementation
difference with that, but that's a separate (non-)issue.
We can show that with simplest minimized cases, e.g.:
GNU/Linux:
$ echo 'xx' | grep -E -e '(x)\1'
xx
$ echo 'x2' | grep -E -e '(x)\2'
grep: Invalid back reference
$
BSD:
$ echo 'xx' | grep -E -e '(x)\1'
$ echo 'x2' | grep -E -e '(x)\2'
x2
$

> From: "Ivan Sergio Borgonovo" <mail at webthatworks.it>
> Subject: Re: [conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
> Date: Thu, 18 Feb 2021 11:06:08 +0100

> Is there an -E option on BSD?
> or did you try to use --posix on Linux?
>
> On 2/18/21 10:17, Michael Paoli wrote:
>> Ah, drats ... I'd expect I ought get the same from both of these, ...
>> but no.  Not quite sure if I'm not interpreting things exactly right,
>> or there's sufficient wiggle room with BRE on this in POSIX, or there
>> might possibly actually be a bug in the BSD case on this.
>> But, with both grep and sed, seeing quite different results on these,
>> notably BRE not matching, as in fact expected, yet on BSD matching
>> where not expecting it to match:
>>
>> GNU/Linux:
>> $ echo 'xxo o x  ' | sed -ne  
>> '/^\(...\)\{0,2\}\([ox]\)\2\2/{s/.*/matched/p;}'
>> $ echo 'xxo o x  ' | grep -e '^\(...\)\{0,2\}\([ox]\)\2\2'
>> $ echo 'xxo o x  ' | sed -ne 's/^\(...\)\{0,2\}\([ox]\)\2\2/1:\12:\2\&:&/p'
>> $
>>
>> BSD:
>> $ echo 'xxo o x  ' | sed -ne  
>> '/^\(...\)\{0,2\}\([ox]\)\2\2/{s/.*/matched/p;}'
>> matched
>> $ echo 'xxo o x  ' | grep -e '^\(...\)\{0,2\}\([ox]\)\2\2'
>> xxo o x
>> $ echo 'xxo o x  ' | sed -ne 's/^\(...\)\{0,2\}\([ox]\)\2\2/1:\12:\2\&:&/p'
>> 1:xxo2:&:xxo o x
>> $
>>
>> How can \([xo]\) matching zero characters be correct?
>> That seems broken to me.
>>
>> But these work the same on both, as expected:
>> $ echo 'xox' | grep '^\([xo]\)\1\1$'
>> $ echo 'xxx' | grep '^\([xo]\)\1\1$'
>> xxx
>> $ echo 'xox' | sed -ne '/^\([xo]\)\1\1$/p'
>> $ echo 'xxx' | sed -ne '/^\([xo]\)\1\1$/p'
>> xxx
>> $




More information about the conspire mailing list