[sf-lug] gpg, dd, find, tar, cpio, bugs, pax, ... oh my! :-)

Fri Nov 13 05:02:58 PST 2015

Sorry you had a less than stellar experience with pax(1).  Personally I
quite like pax(1). ...

> From: "Daniel Gimpelevich" <daniel at gimpelevich.san-francisco.ca.us>
> Subject: Re: [sf-lug] gpg, dd, find, tar, cpio, bugs, pax, ... oh my!  :-)
> Date: Wed, 21 Oct 2015 23:46:56 -0700

> On Mon, 2015-02-02 at 20:15 -0800, Michael Paoli wrote:
>> For another example of GNU bloat (and bugs!), have a read through this
>> earlier posting of mine:
>> [buug] Why I tend to prefer pax(1) over tar(1) and cpio(1) ...
>> http://buug.org/pipermail/buug/2012-October/003944.html
>
> Welp, I just went back to that message so that I could issue the
> following commands in favor of a tar command:
>
> cd /mnt/a
> sudo find . -depth -print0 | sudo pax -w -0d -x sv4cpio | xz >  
> /mnt/b/backup.cpio.xz
>
> I am now kicking myself because I come back 5 days later to see a couple
> of instances of:
>
> pax: Unable to access ./foo: Value too large for defined data type
>
> Both files were Thunderbird IMAP mbox files, easily redownloadable, each
> in excess of 2GiB. Maybe I should've stuck with tar for snapshotting the
> increasingly failing hard disk in the s.o.'s laptop. I'm now looking at
> B&H Photo's page on the Samsung 850 Evo to take its place.

Well, many backup formats have *some* limitation(s).  The manpage
(e.g. pax(1)) is your friend ... and so is stderr (Unix tends to be
quiet when it works, and complain when it doesn't - Linux somewhat
similar, but sometimes more chatty - but not nearly as atrocious as
Microsoft Windows).  Exit/return values are also useful - 0 is
convention on Unix/Linux for success/"true" for commands and within
shell, and many (but definitely not all - e.g. Perl) other areas within
Linux/Unix/BSD.  So a non-zero return when archiving will also generally
indicate some type of error or failure ... however some such errors are
to be expected if one is backing up a live (mounted rw and being
actively written) filesystem (e.g. file is in a directory when the
directory is read, but file no longer exists by the time the archive
program goes to open the file to back it up).

Note also that those format limitations (if "any" - or any for practical
purposes) may vary at least somewhat by distribution /
operating system flavor / implementation.  As practical matter, I've
found most any such limitations are potentially a consideration when
creating an archive with pax - or really any archiving software, though
I've never hit an issue with extraction using pax
(pax automatically determines format upon extraction - which is quite
convenient ... but cpio and tar will also generally do that too).  Only
issues I've ever encountered extracting with pax were issues not with
pax, but of the operating system filesystem type (e.g. filesystem
limitation on FAT or vFAT or NTFS flavors of filesystems or older
filesystem types that don't support files >=2GiB).  I know, e.g. under
Linux, SunOS, and Cygwin under Microsoft Windows, I've found what works
best and quite reliably for pax backup/archive format does vary - but
once I determine one that works well, it works exceedingly reliably.
Let's see ... on Linux, with pax, I've been using ...
pax -w -0d -x sv4cpio
without issue.  I've also fount pax is quite good
about complaining to stderr if it encounters any such limitations or
issues - either backing up - or restoring (e.g. as far as I'm aware, it
won't do files of type socket - but that may or may not be limitation of
pax itself).

So, ... let's test a wee bit, see if I can bump into some limitations
... peeking at pax(1), formats - or more specifically format options I
see for the formats it supports:
ar bcpio cpio sv4cpio sv4crc tar ustar
Let's try some simple exercises:

$ >foo echo foo
$ (for x in '' ar bcpio cpio sv4cpio sv4crc tar ustar; do
> echo foo | pax ${x:+-x} $x -w > pax.$x; done)
$ file pax*
pax.:        POSIX tar archive
pax.ar:      current ar archive
pax.bcpio:   byte-swapped cpio archive
pax.cpio:    ASCII cpio archive (pre-SVR4 or odc)
pax.sv4cpio: ASCII cpio archive (SVR4 with no CRC)
pax.sv4crc:  ASCII cpio archive (SVR4 with CRC)
pax.tar:     tar archive
pax.ustar:   POSIX tar archive
$

So far no complaints, so we're probably fine.  Notice also file(1) is
able to distinguish each of the available formats, and appears our
default format is ustar.

$ (mkdir x && cd x && for a in ../pax*; do echo "$a:"; {
> < "$a" pax -r -p e && cmp foo ../foo && rm foo; } || {
> 1>&2 echo failed on "$a"; break; }; done; cd .. && rmdir x)
../pax.:
../pax.ar:

ATTENTION! pax archive volume change required.
Ready for archive volume: 2
Input archive name or "." to quit pax.
Archive name > .
Quitting pax!
../pax.bcpio:
../pax.cpio:
../pax.sv4cpio:
../pax.sv4crc:
../pax.tar:
../pax.ustar:
$

Well, looks like pax(1) doesn't know when the ar(1) format archive ends
- that may be a limitation of the ar(1) format (it's quite basic and
limited), but in all cases, including the ar format, it extracted the
file just fine and matched as we expected.

Now let's try something more challenging.  This *will* fail for at least
some formats due to limitations of the formats.  I'm not going to bother
to try to extract/restore these, due to space and time considerations
and limits, but just check if they can be successfully archived - and
will also use pax to list their contents of whatever they archive (which
may not be successfully archived if it warned while archiving).

Here we create some files of various sizes, from 1 KiB to 16 GiB,
and see what happens when we attempt to archiving them using various
formats pax offers us:
$ (for s in 1024 1048576 1073741824 2147483647 2147483648 4294967295
> 4294967296 8589934591 8589934592 17179869183 17179869184; do
> truncate -s "$s" s"$s"; done; stat -c '%s %n' s* | sort -n; s=$(
> ls s* | sort -k 1.2n); for x in ar bcpio cpio sv4cpio sv4crc tar ustar
> do echo "$x":; echo "$s" | pax ${x:+-x} $x -w | pax; done; rm s*)
1024 s1024
1048576 s1048576
1073741824 s1073741824
2147483647 s2147483647
2147483648 s2147483648
4294967295 s4294967295
4294967296 s4294967296
8589934591 s8589934591
8589934592 s8589934592
17179869183 s17179869183
17179869184 s17179869184
ar:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
s4294967296
s8589934591
s8589934592
pax: size overflow for s17179869183
pax: size overflow for s17179869184

ATTENTION! pax archive volume change required.
Ready for archive volume: 2
Input archive name or "." to quit pax.
Archive name > .
Quitting pax!
bcpio:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
pax: File is too large for bcpio format s4294967296
pax: File is too large for bcpio format s8589934591
pax: File is too large for bcpio format s8589934592
pax: File is too large for bcpio format s17179869183
pax: File is too large for bcpio format s17179869184
cpio:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
s4294967296
s8589934591
pax: File is too large for cpio format s8589934592
pax: File is too large for cpio format s17179869183
pax: File is too large for cpio format s17179869184
sv4cpio:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
pax: File is too large for sv4cpio format s4294967296
pax: File is too large for sv4cpio format s8589934591
pax: File is too large for sv4cpio format s8589934592
pax: File is too large for sv4cpio format s17179869183
pax: File is too large for sv4cpio format s17179869184
sv4crc:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
pax: File is too large for sv4cpio format s4294967296
pax: File is too large for sv4cpio format s8589934591
pax: File is too large for sv4cpio format s8589934592
pax: File is too large for sv4cpio format s17179869183
pax: File is too large for sv4cpio format s17179869184
tar:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
s4294967296
s8589934591
pax: File is too large for tar s8589934592
pax: File is too large for tar s17179869183
pax: File is too large for tar s17179869184
ustar:
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
s4294967296
s8589934591
pax: File is too long for ustar s8589934592
pax: File is too long for ustar s17179869183
pax: File is too long for ustar s17179869184
$

Well, the results look a bit surprising (and probably worth verifying to
confirm) of the sizes tested and formats, these were the largest sizes
successfully backed up and the format:
8589934592 ar
8589934591 ustar
8589934591 tar
8589934591 cpio
4294967295 sv4crc
4294967295 sv4cpio
4294967295 bcpio
None of the formats were able to archive the 16GiB-1byte length file.

Curious how that compares to GNU tar - since so many use that.

$ (for s in 1024 1048576 1073741824 2147483647 2147483648 4294967295 \
> 4294967296 8589934591 8589934592 17179869183 17179869184; do
> truncate -s "$s" s"$s"; done; stat -c '%s %n' s* | sort -n; s=$(
> ls s* | sort -k 1.2n); tar -cf - $s | tar -tf -; rm s*)
1024 s1024
1048576 s1048576
1073741824 s1073741824
2147483647 s2147483647
2147483648 s2147483648
4294967295 s4294967295
4294967296 s4294967296
8589934591 s8589934591
8589934592 s8589934592
17179869183 s17179869183
17179869184 s17179869184
s1024
s1048576
s1073741824
s2147483647
s2147483648
s4294967295
s4294967296
s8589934591
s8589934592
s17179869183
s17179869184
$
And, somewhat surprisingly, GNU tar does handle through at least 16GiB.
Let's try pushing it and see what happens:
$ (m=-1; e=34; while :; do s=$(
> perl -e '$_=(q(2*)x'"$e"');s/\*$//;print "$_'"$m"'\n";' | bc -l)
> truncate -s "$s" s"$s" || break; echo "$s:"
> { tar -cf - s"$s" || break; } | tar -tf - || break
> rm s*
> if [ -n "$m" ]; then m=; else m=-1; e=$(expr "$e" + 1); fi
> done)
17179869183:
s17179869183
17179869184:
s17179869184
34359738367:
s34359738367
34359738368:
s34359738368
68719476735:
s68719476735
68719476736:
s68719476736
137438953471:
s137438953471
137438953472:
s137438953472
274877906943:
s274877906943
274877906944:
s274877906944
549755813887:
s549755813887
549755813888:
s549755813888
1099511627775:
s1099511627775
1099511627776:
s1099511627776
^C
$
Okay, longer than my time/patience.
Let's try it from the other way around, until it at least starts to
work, then we'll quit there.
$ (m=-1; e=66; while :; do
> if [ -n "$m" ]; then m=; e=$(expr "$e" - 1); else m=-1; fi
> s=$(perl -e '$_=(q(2*)x'"$e"');s/\*$//;print "$_'"$m"'\n";' | bc -l)
> truncate -s "$s" s"$s" || continue; echo "$s:"
> { tar -cf - s"$s" || continue; } | tar -tf - || continue
> >>/dev/null 2>&1 rm s*; break; done)
truncate: invalid number '36893488147419103232': Value too large for  
defined data type
truncate: invalid number '36893488147419103231': Value too large for  
defined data type
truncate: invalid number '18446744073709551616': Value too large for  
defined data type
truncate: invalid number '18446744073709551615': Value too large for  
defined data type
truncate: invalid number '9223372036854775808': Value too large for  
defined data type
9223372036854775807:
s9223372036854775807
^C
$
So, supposedly it will handle 2^63-1 - which is largest file for
Linux/Unix/BSD 64 bit filesystem.  I didn't let the check run to
conclusion, as that would be quite long (tar -tf - spits out the member
long before it completes reading all it's data).

Oh, also to keep in mind, tar and cpio formats do behave rather
differently when restoring files archived with multiple hard links.
In the case of cpio, restoring any and/or all of the links restores the
backed up data, and all links restored at the same time will be hard
linked.  With tar, the first occurrence of the file backed up, when
restored, restores that file.  Restoring subsequent links from tar
format, attempts to recreate the original hard link relationship.  That
can lead to quite surprising results if the file with the first pathname
has changed, and one restores from tar, links other than the first
pathname that was archived.  Let's see if pax handles that likewise:

$ (for x in ar bcpio cpio sv4cpio sv4crc tar ustar; do echo one > one
> ln one two; pax -w -x "$x" one two > pax; rm two; echo newer > one
> < pax pax -r -p e two; echo "$x:"; cat two; rm one two pax; done)

ATTENTION! pax archive volume change required.
Ready for archive volume: 2
Input archive name or "." to quit pax.
Archive name > .
Quitting pax!
ar:
one
bcpio:
newer
cpio:
newer
sv4cpio:
newer
sv4crc:
newer
tar:
newer
ustar:
newer
$

Well, ar(1) knows nothing of hard links, so simply archives both files
redundantly, and simply restores the requested file, without creating
any hard links.  All the others behave more like tar, recreating the
hard link relationship, but not restoring the file data.  That, however,
isn't how cpio behaves:
$ (echo one > one; ln one two; ls -d one two | cpio -o > cpio; rm two
> echo newer > one; < cpio cpio -imu two; ls -l two; cat two
> rm one two cpio)
1 block
1 block
-rw------- 1 michael users 4 Nov 11 22:12 two
one
$

So, ... in brief (over?)simplification ... if you want the many
advantages of pax, including quite reliable, much more portable, etc.,
pax is a good choice,  However it has some limitations (or more
specifically, the formats it supports have some limitations).
And at least when I tested it, all it's formats supported files up to at
least 2^32-1 bytes (4GiB - 1 byte), and many of it's formats to
2^33-1 bytes (8GiB - 1 byte).

Don't know that GNU tar matches to any archive format other than its
own (or at least quite extends other standard tar formats), so may not
be guaranteed to be able to extract it with other software than GNU tar
itself.  Of course (non-ancient versions of) dd(1) can also handle large
files just fine too ... but that's file-by-file, and not an archiver.

But on Linux, if you've got files >8GiB, GNU tar might be your only
feasible option (besides dd, and filesystem snapshots and the like).

We could do more experiments on filename length, depth of directories,
etc., and we'd (at least generally) find various limits among the
archive formats.