[conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
Michael Paoli
Michael.Paoli at cal.berkeley.edu
Thu Feb 18 11:24:26 PST 2021
Yeah, as far as I can tell, would look to be an issue/bug with
BSD BRE. It behaves exactly as expected with Solaris 11:
$ uname -s -r -m
SunOS 5.11 i86pc
$ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
YYxx
$ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
YYxx
$ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
YYxxz
$ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
YYxxz
$ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
$ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
YYxyxy
$ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
YYxyxy
$
I'd check Seventh Edition Unix (1979) - but that's too old in this
case, as it doesn't support the \{m,n\}, etc. quantifier notation - that
came sometime later.
So, maybe I found my first BSD bug.
The BSD in question:
$ uname -s -r -v -p
OpenBSD 6.7 GENERIC#7 amd64
$
> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
> Subject: Re: [conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
> Date: Thu, 18 Feb 2021 05:45:16 -0800
> POSIX gives no -E option to sed:
> https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
> however both GNU and BSD sed support -E option to use ERE rather than
> BRE.
> That does also change the REs syntax, notably
> \{\} and \(\) vs. {} and ().
> GNU's sed and grep support POSIXLY_CORRECT, but of them, only
> GNU's sed supports the --posix option.
> https://www.gnu.org/software/sed/manual/sed.html
> https://www.gnu.org/software/grep/manual/grep.html
>
> With -E:
> GNU/Linux & BSD both give same expected results on these:
> $ echo 'xxo o x ' | sed -E -ne '/^(...){0,2}([ox])\2\2/{s/.*/matched/p;}'
> $ echo 'xxo o x ' | grep -E -e '^(...){0,2}([ox])\2\2'
> $ echo 'xxo o x ' | sed -E -ne 's/^(...){0,2}([ox])\2\2/1:\12:\2\&:&/p'
> $ echo 'xox' | grep -E '^([xo])\1\1$'
> $
>
> Working to find minimal case of unexpected BRE sed behavior on BSD
> ("vs." the expected on GNU),
> seems we have some unexpected interaction between use of
> subexpression \(\)
> and
> repeat/duplication count specifier \{m,n\} \{m\} \{m,\} \{,m\}
> BSD:
> $ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
> YYxx
> $ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
> YYxxz
> $ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
> YYxxz
> $ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
> $ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
> YYxyxy
> $
>
> GNU/Linux (all as expected):
> $ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
> YYxx
> $ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
> YYxxz
> $ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
> YYxxz
> $ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
> YYxyxy
> $
>
> POSIX BRE & ERE
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
> doesn't specify back-references in EREs, so we have an implementation
> difference with that, but that's a separate (non-)issue.
> We can show that with simplest minimized cases, e.g.:
> GNU/Linux:
> $ echo 'xx' | grep -E -e '(x)\1'
> xx
> $ echo 'x2' | grep -E -e '(x)\2'
> grep: Invalid back reference
> $
> BSD:
> $ echo 'xx' | grep -E -e '(x)\1'
> $ echo 'x2' | grep -E -e '(x)\2'
> x2
> $
>
>> From: "Ivan Sergio Borgonovo" <mail at webthatworks.it>
>> Subject: Re: [conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
>> Date: Thu, 18 Feb 2021 11:06:08 +0100
>
>> Is there an -E option on BSD?
>> or did you try to use --posix on Linux?
>>
>> On 2/18/21 10:17, Michael Paoli wrote:
>>> Ah, drats ... I'd expect I ought get the same from both of these, ...
>>> but no. Not quite sure if I'm not interpreting things exactly right,
>>> or there's sufficient wiggle room with BRE on this in POSIX, or there
>>> might possibly actually be a bug in the BSD case on this.
>>> But, with both grep and sed, seeing quite different results on these,
>>> notably BRE not matching, as in fact expected, yet on BSD matching
>>> where not expecting it to match:
>>>
>>> GNU/Linux:
>>> $ echo 'xxo o x ' | sed -ne
>>> '/^\(...\)\{0,2\}\([ox]\)\2\2/{s/.*/matched/p;}'
>>> $ echo 'xxo o x ' | grep -e '^\(...\)\{0,2\}\([ox]\)\2\2'
>>> $ echo 'xxo o x ' | sed -ne 's/^\(...\)\{0,2\}\([ox]\)\2\2/1:\12:\2\&:&/p'
>>> $
>>>
>>> BSD:
>>> $ echo 'xxo o x ' | sed -ne
>>> '/^\(...\)\{0,2\}\([ox]\)\2\2/{s/.*/matched/p;}'
>>> matched
>>> $ echo 'xxo o x ' | grep -e '^\(...\)\{0,2\}\([ox]\)\2\2'
>>> xxo o x
>>> $ echo 'xxo o x ' | sed -ne 's/^\(...\)\{0,2\}\([ox]\)\2\2/1:\12:\2\&:&/p'
>>> 1:xxo2:&:xxo o x
>>> $
>>>
>>> How can \([xo]\) matching zero characters be correct?
>>> That seems broken to me.
>>>
>>> But these work the same on both, as expected:
>>> $ echo 'xox' | grep '^\([xo]\)\1\1$'
>>> $ echo 'xxx' | grep '^\([xo]\)\1\1$'
>>> xxx
>>> $ echo 'xox' | sed -ne '/^\([xo]\)\1\1$/p'
>>> $ echo 'xxx' | sed -ne '/^\([xo]\)\1\1$/p'
>>> xxx
>>> $
More information about the conspire
mailing list