[conspire] Piping, redirection and shellscipts: 3/5/2025 7pm Eastern Standard time
Michael Paoli
michael.paoli at berkeley.edu
Tue Mar 11 18:57:10 PDT 2025
And, bit more from the meeting, and some examples:
On Mon, Mar 3, 2025 at 9:07 AM Michael Paoli <michael.paoli at berkeley.edu> wrote:
> $ comm -23 <(sort -u < file1) <(sort -u < file2)
> That, among multiple possible ways, will give exactly once, each
> unique line present in file1 that isn't present in file2.
So, say we've got two files, color, and fruit,
and they're not deduplicated nor in any particular order:
$ more * | cat
::::::::::::::
color
::::::::::::::
orange
green
green
orange
::::::::::::::
fruit
::::::::::::::
orange
orange
apple
apple
$
If more has multiple file arguments, and its stdout isn't a tty
device, it simply
outputs the contents of the files, prepended by a header giving the file name.
It has to be good for something that less won't do, eh? :-)
(and why do I find that behavior undocumented on current man page,
whereas I well read of it decades ago)
First of all, we use may use - for one of comm's non-option arguments,
in which case it will use stdin for that, rather than using that as
literal file name (a fairly common convention for *nix commands, where - may
be used to signify to use stdin or stdout, as relevant, rather than treating
that as a literal file name to be used for such).
$ sort -u < color | comm -23 - <(sort -u < fruit)
green
$
That then allows us to get rid of one of the process substitutions
by bash, and we again use sort -u to sort and deduplicate (equivalent
to sort | uniq, but without the overhead of a whole additional process
and pipe and all the lines being written and read yet again).
We can even do it without comm at all. So, comm, by default, 3 columns of
output. It expects its inputs to already be sorted, and typically one would
also have those inputs deduplicated (but that's optional, depending exactly
what one wants). And the options, -1, -2, and -3 (which can be bundled)
respectively may be used to suppress comm's default 1st, 2nd, and 3rd
columns of output, which are respectively lines only in the 1st input,
the 2nd input, and lines common to both.
Well, we can do likewise, and without comm, and even without bash or its
process substitution, and likewise without any temporary files, to still, e.g.
get the lines unique to the first file:
$ { sort -u < color; sort -u < fruit | sed p; } | sort | uniq -u
green
$
That can be quite handy if one is in a quite limited environment
that doesn't even have comm available, but does have sort,
sed (or awk) and uniq available, or one lacks bash or
process substitution capability. Instead of sed p, we
could have used awk '{print; print; }'
So, our little example above, for each of the two files,
it sorts and deduplicates them, but after, on the second,
it then duplicates every line. We then take those combined
outputs, sort them, and run through uniq -u,
thus leaving us only with lines (and deduplicated) that are
unique to the first file. Again, note we duplicated all
lines in the second file, so those wouldn't be unique,
and deduplicated those in the first, so they may end up as
unique output - but only if not also present in the second,
as that's also sorted together with the deduped content from
first, then run through unique -u, so we end up only with lines
unique to the first file.
Note also the use of { command ...[; command ...] ...; }
So, the curly braces thus used in shell, logically group one or more commands,
not to be confused with (), which instead do a subshell. So, {} ends up
being kind of like a lighter weight version of that (no subshell, no
separate environment
created, etc. - e.g. things such as shell variables/parameters, environment,
current working directory, changed within {} also impact outside of
{}, because not a
subshell, just logically grouped. Also, the synax is bit different too.
() are directly recognized by the shell as special characters, so
don't need whitespace or other
delimiters around them, whereas {} need to be delimited for shell to
recognize them as separate words,
then they have that special grouping meaning, and for that same
reason, need end the commands
with ; or newline before closing } for the syntax to be correct. So,
may see {} used for grouping
like that in context such as the multiple commands used inside a
defined function, but such grouping
isn't limited to functions like that, and may be used more generally.
https://www.mpaoli.net/~michael/unix/sh/syntax_with_examples/curly_braces_syntax
More information about the conspire
mailing list