[conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)

Michael Paoli Michael.Paoli at cal.berkeley.edu
Tue Feb 23 06:42:16 PST 2021


And bug reported:
https://marc.info/?l=openbsd-bugs&m=161408631731043
... shall see where it goes.

> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
> Subject: Re: [conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
> Date: Thu, 18 Feb 2021 11:24:26 -0800

> Yeah, as far as I can tell, would look to be an issue/bug with
> BSD BRE.  It behaves exactly as expected with Solaris 11:
> $ uname -s -r -m
> SunOS 5.11 i86pc
> $ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
> YYxx
> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
> YYxx
> $ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
> YYxxz
> $ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
> YYxxz
> $ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
> YYxyxy
> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
> YYxyxy
> $
>
> I'd check Seventh Edition Unix (1979) - but that's too old in this
> case, as it doesn't support the \{m,n\}, etc. quantifier notation - that
> came sometime later.
>
> So, maybe I found my first BSD bug.
>
> The BSD in question:
> $ uname -s -r -v -p
> OpenBSD 6.7 GENERIC#7 amd64
> $
>
>> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
>> Subject: Re: [conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
>> Date: Thu, 18 Feb 2021 05:45:16 -0800
>
>> POSIX gives no -E option to sed:
>> https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
>> however both GNU and BSD sed support -E option to use ERE rather than
>> BRE.
>> That does also change the REs syntax, notably
>> \{\} and \(\) vs. {} and ().
>> GNU's sed and grep support POSIXLY_CORRECT, but of them, only
>> GNU's sed supports the --posix option.
>> https://www.gnu.org/software/sed/manual/sed.html
>> https://www.gnu.org/software/grep/manual/grep.html
>>
>> With -E:
>> GNU/Linux & BSD both give same expected results on these:
>> $ echo 'xxo o x  ' | sed -E -ne '/^(...){0,2}([ox])\2\2/{s/.*/matched/p;}'
>> $ echo 'xxo o x  ' | grep -E -e '^(...){0,2}([ox])\2\2'
>> $ echo 'xxo o x  ' | sed -E -ne 's/^(...){0,2}([ox])\2\2/1:\12:\2\&:&/p'
>> $ echo 'xox' | grep -E '^([xo])\1\1$'
>> $
>>
>> Working to find minimal case of unexpected BRE sed behavior on BSD
>> ("vs." the expected on GNU),
>> seems we have some unexpected interaction between use of
>> subexpression \(\)
>> and
>> repeat/duplication count specifier \{m,n\} \{m\} \{m,\} \{,m\}
>> BSD:
>> $ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
>> YYxx
>> $ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
>> YYxxz
>> $ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
>> YYxxz
>> $ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
>> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
>> $ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
>> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
>> YYxyxy
>> $
>>
>> GNU/Linux (all as expected):
>> $ echo 'YYxx' | sed -ne 's/Y*\(x\)\1/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)\1/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}\(x\)\1/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{0,\}\(x\)/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{1,\}/&/p'
>> YYxx
>> $ echo 'YYxx' | sed -ne 's/Y\{2,\}x\{0,\}/&/p'
>> YYxx
>> $ echo 'YYxxz' | sed -ne 's/Y\{2,\}x\{0,\}z/&/p'
>> YYxxz
>> $ echo 'YYxxz' | sed -ne 's/Y\{0,\}x\{0,\}z/&/p'
>> YYxxz
>> $ echo 'YYxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
>> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y*\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{2,\}\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)\1/&/p'
>> YYxyxy
>> $ echo 'YYxyxy' | sed -ne 's/Y\{0,\}\(xy\)xy/&/p'
>> YYxyxy
>> $
>>
>> POSIX BRE & ERE
>> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>> doesn't specify back-references in EREs, so we have an implementation
>> difference with that, but that's a separate (non-)issue.
>> We can show that with simplest minimized cases, e.g.:
>> GNU/Linux:
>> $ echo 'xx' | grep -E -e '(x)\1'
>> xx
>> $ echo 'x2' | grep -E -e '(x)\2'
>> grep: Invalid back reference
>> $
>> BSD:
>> $ echo 'xx' | grep -E -e '(x)\1'
>> $ echo 'x2' | grep -E -e '(x)\2'
>> x2
>> $
>>
>>> From: "Ivan Sergio Borgonovo" <mail at webthatworks.it>
>>> Subject: Re: [conspire] Basic Regular Expressions (GNU/Linux "vs." BSD)
>>> Date: Thu, 18 Feb 2021 11:06:08 +0100
>>
>>> Is there an -E option on BSD?
>>> or did you try to use --posix on Linux?
>>>
>>> On 2/18/21 10:17, Michael Paoli wrote:
>>>> Ah, drats ... I'd expect I ought get the same from both of these, ...
>>>> but no.  Not quite sure if I'm not interpreting things exactly right,
>>>> or there's sufficient wiggle room with BRE on this in POSIX, or there
>>>> might possibly actually be a bug in the BSD case on this.
>>>> But, with both grep and sed, seeing quite different results on these,
>>>> notably BRE not matching, as in fact expected, yet on BSD matching
>>>> where not expecting it to match:
>>>>
>>>> GNU/Linux:
>>>> $ echo 'xxo o x  ' | sed -ne  
>>>> '/^\(...\)\{0,2\}\([ox]\)\2\2/{s/.*/matched/p;}'
>>>> $ echo 'xxo o x  ' | grep -e '^\(...\)\{0,2\}\([ox]\)\2\2'
>>>> $ echo 'xxo o x  ' | sed -ne  
>>>> 's/^\(...\)\{0,2\}\([ox]\)\2\2/1:\12:\2\&:&/p'
>>>> $
>>>>
>>>> BSD:
>>>> $ echo 'xxo o x  ' | sed -ne  
>>>> '/^\(...\)\{0,2\}\([ox]\)\2\2/{s/.*/matched/p;}'
>>>> matched
>>>> $ echo 'xxo o x  ' | grep -e '^\(...\)\{0,2\}\([ox]\)\2\2'
>>>> xxo o x
>>>> $ echo 'xxo o x  ' | sed -ne  
>>>> 's/^\(...\)\{0,2\}\([ox]\)\2\2/1:\12:\2\&:&/p'
>>>> 1:xxo2:&:xxo o x
>>>> $
>>>>
>>>> How can \([xo]\) matching zero characters be correct?
>>>> That seems broken to me.
>>>>
>>>> But these work the same on both, as expected:
>>>> $ echo 'xox' | grep '^\([xo]\)\1\1$'
>>>> $ echo 'xxx' | grep '^\([xo]\)\1\1$'
>>>> xxx
>>>> $ echo 'xox' | sed -ne '/^\([xo]\)\1\1$/p'
>>>> $ echo 'xxx' | sed -ne '/^\([xo]\)\1\1$/p'
>>>> xxx
>>>> $
>





More information about the conspire mailing list