[conspire] rsync for backups?

Michael Paoli Michael.Paoli at cal.berkeley.edu
Mon May 27 19:04:43 PDT 2019


> From: "Kim Davalos" <kdavalos at sonic.net>
> Subject: Re: [conspire] rsync for backups?
> Date: Mon, 27 May 2019 14:46:19 -0700

> If I were using rsync underwhat circumstances what I encounter this  
> use case in the wild?
> Also, is this use case covered in the documentation, i.e., is this  
> an expected behavior?

Question a bit vague/ambiguous, but in any case ...
So, some key bits I was pointing out.
If one is using/depending upon rsync for backup(s)
o One would be well advised to carefully read the documentation (man page)
   notably including rsync's options, and particularly any impacts they,
   or their lack thereof, may have upon one's backups and what one
   hopes/intends to do - vs. what one might presume if one doesn't
   bother to read the relevant documentation.
o Most specifically, even with the -a or --archive option, leaving out
   the option --ignore-times may result in disappointingly unexpected
   results if one hadn't bothered to reasonably read - or at least
   peruse/skim the relevant documentation.  Most notably, without
   the --ignore-times option, a backup target may well fail to be
   refreshed if the data has changed but the length and mtime of
   file happens to be the same.
o The other (longer) bits are some typical examples I use (which also
   do include the --ignore-times option).

"Of course" too, if one (significantly?) prefers speed and reduction
in bandwidth / I/O of backups, and more so willing to risk some
low(ish) probability of some data loss, then one might prefer to
leave out the --ignore-times options.  I just think one ought be
reasonably aware of what the tradeoffs are either way.

And some bits I could've expanded upon, but Rick also points out or
at least touches upon.  Yes, testing good.  And that can also be
particularly important when one is "sure" ... but ... hasn't tested.
Backups are important.  Sure, a lot of options I show included, such
as --xattrs might not apply (or matter or matter much) in some/many
circumstances.  I included them for *relative* completeness, where
my objective was archival quality backup that I could (or at least
should be able to) depend upon.  The options I include try to get
all relevant data and metadata (at least of files) I might actually
possibly care about - but those attributes and metadata don't
necessarily apply to most files and/or filesystems - in fact they
may only apply to very small number to zero files ... but if they
do happen to apply, well, then I also manage to get those bits
backed up ... probably not critical at all, but also certainly
doesn't hurt.  The --ipv4 - only applied to my particular
scenario, and yes, it forces IPv4 - effectively ignores IPv6. And
why did I do that?  Again, my particular scenario.  The IPv4 is
native with both hosts on same physical LAN.  The (public routable,
and thus corresponding to hostnames - their DNS entries) IPv6
is tunneled with both hosts on different subnets - traversing
the Internet IPv6 that way is vastly slower in my particular
case (both route round-trip over a relatively slow DSL connection),
whereas with IPv4 the connection is over a gigabit local LAN (or
even faster if/where/when virtual).  I could've alternatively
used some IPv6 that would be common to both [v]LANS, but to do that
I'd need to do it by IP address, or do different DNS resolution locally,
or special different names for the hosts or whatever - and remember to
use that, etc. - simpler just to force using IPv4 in that particular
case.  Were the situation reversed (hey, IPv4 is getting increasingly
scarce relative to demand), I might've likewise forced IPv6 (--ipv6)
... well, wetware ain't perfect - peeking at man page, it's *prefer*,
not force.

Anyway, as I said and/or more-or-less alluded to, rsync takes, uhm,
"shortcuts".  That can be highly useful - e.g. it doesn't have to
send all the data from source to target notably where that's
(significantly) redundant with what's already present at target.
But to do so ... shortcuts - e.g. it does hashes on chunks of
data on the files - and avoids transferring the file's data
where those hashes match.  But by default it goes further than
that.  E.g. pathname, lengh, and mtime match?  Don't even
bother to do hash, presume they're the same - that could bite one.
But there's an option to change that behavior: --ignore-times
So, the shortcuts can be highly useful to, e.g. reduce bandwidth and
I/O used (on network, written to drive), but too, they might bite
one if one doesn't pay reasonably close attention.

Oh, and *no* backup software is perfect ... and I'm not talkin'
about bugs, etc., either.  If you're backing up data on a *live*
system - notably filesystems mounted rw, there will always be *some*
compromises that need be made.  But with due caution and attention,
one typically won't get bitten by those issues.  But in more extreme
cases (what, me like edge cases?  :-)), one may need to pay even more
attention and test accordingly.  Don't recall author - was quite a
while back (and may have been quite updated/superseded since then),
but I well recall a USENIX paper(/book?) covering UNIX backup and
... well, torture testing and issues and edge cases thereof,
and all the things that can (and do, or might) go wrong - even when
the software does exactly what it's supposed to do.
To simply illustrate a typical example case or two:
o Let's say one has a very large very active file, data in it (or even
   being appended to it) changes pretty much "constantly" (faster than
   one can read the file end-to-end).  So ... you (the software) backs
   up the file.  What's its length and data and inode data?  Well,
   multiuser multitasking operating system.  As one reads the file,
   data changes.  Is the data read later in the file consistent with that
   read earlier within the file?  What if they're linked transactional
   data that needs be consistent?  What about the metadata - what's the
   logical length and mtime.  There's no way to do an atomic read of the
   inode data (lstat(2) data) and all the data in the file at the same
   time and be assured none of that changes from start of reading
   that data and metadata to end of reading them.  Other PIDs may
   change things effectively "simultaneously" (while that stuff is
   going on).  There are ways to work-around much of that, e.g.
   take snapshot of file, backup snapshot - that doesn't guarantee
   a clean backup of the filesystem, but should be a recoverable
   backup and sufficiently consistent - at least if the filesystem
   etc. is behaving as it should.  Okay, that solved the issue for
   *one* filesystem (kind'a) ... but what if one needs data (e.g.
   transactional) across multiple filessytems at the same time?
   There isn't some way to do atomic multiple simultaneous snapshots
   (at least that I'm aware of) across multiple filesystems at
   same time - at least on live system and within the OS.  Now, if
   it's virtual storage, might be possible to snapshot the storage
   from outside the VM.  Uh huh, and what about backing up the
   physical?
o What about unlinked open files?  In most cases one would ignore those
   for backups.  But what if one needs the best possible forensic quality
   backup of what was happening at the time?
Anyway, if one digs sufficiently (too? ;-)) deep, a highly to
exceedingly accurate complete backup - on a live rw system, becomes
a fairly complex question.
But for most of us, and most backups, doing backups, and doing backups
that are "good enough" and tested at least once in a while, is quite
darn good enough.
As if oft said, any backup is better than none.  (Reminds me, I have
a seriously flaked out older drive that's been in off-site backup
rotations that needs replacing - it can't be relied upon, so
I'm currently one short of what I prefer present media-wise in my
rotations; in case one is curious, for backup rotations media
(if media can hold multiple full backups - more - or sets, if that's not
the case, my preferred minumum is 3 sets of backup media (or sets
thereof) as minimum - never more than one of those on-site,
and always at least two off-site; remember, media does fail,
one wants a sufficiently low probability of non-recoverability
relative to media failure probabilities, and also taking into account
that media rotates locations, so, e.g, asteroid may land on on-site
while backup is in progress, taking out primary and one media (set)
being backed up to - what's left to restore from is only off-site,
what if some of that media also fails upon restore attempt?  Got
enough backups/media?  Anyway, my preference for minimal (sets of)
backup media: 3  Of course "more is better" ... well, to a
point.).

> On 5/27/19 9:57 AM, Michael Paoli wrote:
>> rsync for backups?  (query/conversation came up recently)
>> Sure, it can be used for that,
>> however, I generally tell folks to carefully review the options;
>> notably as there are some default behaviors of rsync I don't like,
>> and I don't consider the *default* behaviors acceptable/sufficient.
>> However, with careful consideration and use of relevant options,
>> rsync can be quite lovely (and often very efficient) for
>> backups ... or more/most specifically, updating a target
>> to match a source.
>>
>> What hazards by default?  Consider the following:
>> $ echo foo > foo && echo bar > bar && touch -m -r foo bar
>> $ rsync -a foo bar
>> $ cmp foo bar
>> foo bar differ: char 1, line 1
>> $ rsync -a --ignore-times foo bar
>> $ cmp foo bar
>> $
>> So, notice initially, even with the -a (--archive) option,
>> in our example, rsync still fails to replicate foo to bar.
>> Why?  By default, rsync takes some shortcuts - and in my opinion
>> excessively so.  What shortcut(s) specifically?
>> Well, if source and target are same size, and have same
>> mtime (modification time) - and they're the same relative
>> pathnames (I could've created different parent directories on source
>> and target, and same named file within each to make it bit more
>> clear, but in any case ...), it will presume the file contents are
>> the same, and neither read nor update the target.  That could be very
>> hazardous for backups, especially if one really wants to well and
>> accurately backup "everything".
>> The solution for that little bit is --ignore-times
>> that option tells rsync to not consider the modification times,
>> but in all cases do the hashes of the file blocks and update where they
>> don't match.
>> So ... without using such option, by default, newer data could end
>> up not being backed up by rsync.  Even a malicious/crafty user could
>> avoid newer data being picked up by rsync, by keeping pathname, size,
>> and (user settable) mtime of a file the same.
>> And, for the curious, what do my typical rsync options look like when
>> I'm doing a backup?  Let's see ... from some such programs I have
>> laying around (and use semi-regularly) ...
>> rsync \
>>     --archive \
>>     --acls \
>>     --xattrs \
>>     --hard-links \
>>     --numeric-ids \
>>     --relative \
>>     --sparse \
>>     --checksum \
>>     --partial \
>>     --one-file-system \
>>     --delete-excluded \
>>     --ignore-times \
>>     --quiet \
>>     { non-option arguments ... }
>> The above is for a local-to-local (e.g. I physically attach a
>> backup drive).  I've also got the --one-file-system - that's where I
>> don't want to cross filesystem boundaries (mount points).  Be sure to
>> omit such option if you do want to traverse filesystem boundaries.
>> One also might want to omit --quiet ... or not.
>> Let's see ... another example ...
>> rsync \
>>     --ipv4 \
>>     --archive \
>>     --acls \
>>     --xattrs \
>>     --hard-links \
>>     --numeric-ids \
>>     --relative \
>>     --sparse \
>>     --rsh='ssh -aTx -o BatchMode=yes '"$SSH_OPTs" \
>>     --checksum \
>>     --partial \
>>     --one-file-system \
>>     --delete-excluded \
>>     --ignore-times \
>>     --compress-level=9 \
>>     [ optionally filter expression(s) to include/exclude as desired ]
>>     --quiet \
>>     { non-option arguments ... }
>> The above example does a remote to local over ssh.  Again, my
>> options may not be fully suitable for you.  :-)
>> I may also optionally have some stuff set in SSH_OPTs in the
>> environment that I may want to pass along to ssh as options.
>> Whether or not to use compression and/or how aggressively,
>> what's optimal depends on where the bottleneck is or may be.
>> E.g. slow link, reasonably fast CPU, generally high compression good.
>> Fast link & drives, slow CPU: little to no compression likely faster.
>> When in doubt, test.  :-)
>> And don't forget to reasonably test, at least on occasion,
>> when you're damn sure - lest you get bitten hard and unexpectedly.




More information about the conspire mailing list