[sf-lug] New Open Source (Backup) Software Proposal (?)
Michael Paoli
Michael.Paoli at cal.berkeley.edu
Sun Apr 4 11:33:58 PDT 2010
I might be rather - to perhaps quite interested ... but doing such,
and doing it well, could be rather to quite non-trivial.
Perhaps a wiki page to coordinate design of and work on such? ... at
least for starters? E.g. a page somewhere under:
https://secure.balug.org/wiki/doku.php?id=sf-lug:projects
preface notes/comments:
o Open Source does already have a lot to offer in the realm of backup
software (e.g. GNU tar, cpio, Amanda, ...)
o Open Source does already have a lot to offer that's useful to, but not
necessarily, directly, or completely/specifically "backup" software,
e.g.: dd, rsync, gpg, openssl, ssh/scp/sftp, NFS, RAID, AoE, network
block devices, LVM, device/filesystem/file encryption, deduplication,
databases, powerful programming languages, nice GUI
interfaces/frameworks, etc.
o What I see "typically"/mostly missing in the Open Source realm is
something that very nicely pulls many of the existing capabilities
together, and makes an excellent framework for doing backups most any
way one might need or want to do so. "Of course" one can often put
something together fairly easily one's self, mostly using existing
stuff, and adding, e.g., some bits of scripting or programming to tie
the various pieces together ... but such tends to be yet another
one-off (semi-)"solution" - and typically doesn't solve the problem
quite the same way the next person wants it solved (though it still
may be useful to many).
What I'd like to see (or would propose):
o An open modular framework - so as to be quite easy to add additional
features/capabilities/functionality/formats, etc. - in many cases by
effectively just tying into and leveraging other existing software.
Should also be able to tie into quite arbitrary software - including
software that's not necessarily Open Source (e.g. tie in via
commands/IPC/APIs).
o shouldn't need to implement the whole bloody thing/wishlist at once
(lest it may never get done, nor even started)
o should be able to add to / extend it, without breaking backwards
compatibility, and without dragging along excessive cruft for sake of
backwards compatibility - so, sufficiently reasonable up-front
design/planning may be quite important.
o should be written in some capable high-level language(s) (e.g. perl,
python) for the most part - only bits really needing higher levels of
performance optimization (or perhaps security) should be implemented
in (most likely) C - but that may (or may not) be at all necessary.
o should also have a nice capable GUI
o absolutely everything that can be functionally done in the GUI must
also be fully doable from command (CLI).
o "command line window" - something I'd really like to see (have seen it
in commercial Veritas Volume Manager before - not sure if it exists
elsewhere). Don't know about current versions of Veritas Volume
Manager, but some older versions had a feature, I think it was called
"open command line window" ... you did that, then you could see, for
anything and everything that could be done in the GUI, one would see
the command(s) that one could use from the command line (e.g. shell
prompt) to accomplish those same tasks. That was a highly useful
feature - e.g. if one didn't know how to do it from CLI, could do it
in the GUI ... and see how it's done, from that, with CLI. One could
then take that information/knowledge, and script it for the next
rather similar several hundred operations, and not spend hours going
clicky clicky clicky with the GUI. Bonus if the GUI also reflects
updates in real-time, even if they're done independently from a
command prompt. I've also seen other software take a somewhat similar
approach - instead of a "command line window", actions done via GUI
would be written to a log file - in the same form they could also be
executed from CLI.
o should be able to write in quite standard archive formats, e.g. tar,
cpio
o shouldn't be limited to only writing existing common archive formats
o should be capable of backups to various media, e.g. disk, flash, tape,
CD-R[W]/DVD+-R[W]/Blu-ray/... - essentially any reasonably suitable
rewritable or write-once random access or streaming media.
o should be capable of writing targets as filesystems, filesystem
images, or archive formats.
o archive formats should be "enhanced" / added to - or likely better
yet, separate (meta)data also backed up, to quite fully back up all
relevant data and file metadata. E.g., got ACLs? Was that ACL data
backed up? Did you back up your sparse files efficiently? If you
restore the sparse files you backed up, will they be restored with the
sparse blocks as sparse, but the null filled blocks not sparse?
o To the extent feasible, restore - even without specific tool(s) or
software, and bare metal recovery, should be rather to quite easy
(e.g. rather analogous to the approach Amanda uses - short identifying
header block, then GNU tar format - dd and tar suffice to restore).
o It shouldn't be mandatory to restore all metadata - e.g. one may want
to restore to a different filesystem type that doesn't support (the
same) metadata. Not restoring backed up metadata should be able to
generate warning(s) - should also be able to configure suppression of
such warnings (by default or for just specific invocation, such as
with CLI option or likewise via GUI option).
o Metadata backup - there should be reasonable capabilities to backup
metadata - such should be done in forms to both make bare metal
recovery as easy as feasible, and to also make the metadata readily
available in human parseable format(s) - and again without needing
specific software tools at recovery time - so that one can make
various design/layout/configuration decisions at restore time. E.g.:
o Exactly how were the disks/LUNs partitioned?
o What were the exact sizes of each disk/LUN?
o How exactly was RAID configured? Including spares?
o How big, and how full was each filesystem?
o How much space was used at and beneath any given directory on any
filesystem? (at recovery time, one may wish to consider changing
what is and isn't a separate filesystem where - knowing various
sizing information may be relatively critical in being able to
quickly assess and make such decisions).
o if one splits a filesystem upon recovery, does one know about space
increases due to what was hard links, but will become separate
files?
o More/less metadata - what about "foreign" (e.g. NTFS) filesystems?
What about efficiently backing up filesystems that have less metadata
than POSIX (e.g. FAT)? Do we add archive format(s) to cover those,
and/or have some failsafe (e.g. image) method?
o space efficiency - compression (at least as option), also:
o option to turn on/off
o option to adjust compression level/algorithm (space vs. CPU
tradeoffs)
o "smart"/auto adaptive option(s) - e.g. to allow the backup software
to determine if compression is/isn't the bottleneck, and accordingly
use higher/lower(or none) compression levels
o play nice/smart with encryption - compression (if at all) before
encryption, not after.
o scalable/distributable:
o should work reasonably from small single system writing to floppies,
CD-R[W], DVD+-R[W], tape, or other disk(s), up to exceedingly large
distributed enterprise/institution environments
o must be able to work across network
o should be able to work intelligently with SAN
o network protocols and ports - should be as simple as feasible, but
no simpler (e.g. reduce complexity in allowing through firewalls).
Should also be rather configurable (e.g. if one wants to run it on,
e.g., TCP port 22, 25, 80 or 443 because the firewall configurations
already pass those through, that should be doable).
o should be able to work/deal reasonably with low-performance
components (e.g. slow archive devices that are somewhat unreliable
and should be read verified after write, slow/intermittent network
connections, etc.)
o should be able to integrate/manage large numbers of backup clients
and servers - should interoperate quite nicely
o should be able to separate out various client/server functionality
(e.g. to dedicated system(s)/server(s) and/or hardware), e.g.:
o backup clients (pull source data)
o encryption layer(s)
o backup/restore management/control (track/manage everything -
directly or indirectly)
o backup targets (write data to backup storage)
o access controls, etc. - should be able to manage who can do what
where with what server(s)/data, e.g.:
o summary reporting - no details below host and filesystem (no
reporting on specific filenames or users)
o detail reporting - anything but data in files themselves
o backup - make backups, but no other write access (or read
everything, write only to backup targets)
o restore - read/inspect/search backup data, and write data back to
original (or alternate) source locations
o compartmentalization by user/group/server(s)/etc.
o integration into authentication frameworks (e.g. LDAP)
o deduplication - should be smart about and capable of deduplication,
should be highly correct with such (always confirm matched data, not
just match of hash). Should be able to do deduplication on-the-fly
(before writing to non-volatile storage) and/or after-the-fact (e.g.
delay it if necessary for efficiency, or consolidate multiple backup
archives and apply deduplication across them).
o should intelligently and efficiently handle empty/small files (e.g.
those of size less than hash of data), but needn't handle them more
efficiently than source (or target) filesystem itself if writing to
filesystem rather than an archive format. Should also be sufficiently
efficient when writing such to archive format.
o replication - should be able to intelligently replicate backup
archive(s) or any portion thereof to additional copy(/ies) as and
where desired (local or across network). Should be able to specify
synchronous or asynchronous replication, and any applicable
tolerances/delays or scheduling of such. Retention policies should be
configurable to be distinct or identical across replicated target
locations.
o should be highly monitorable - should be able to well report on its
own health (e.g. anything and/or everything that fails proper
self-checks, or that it's been asked to do and wasn't able to do or
had problem(s) with). Should be able to tie in nicely with SNMP and
other means of monitoring. Should be able to disable SNMP if desired.
o database backups:
o should be able to intelligently integrate with databases for proper
"hot" backups, be able to do "cold" DB backups and know that they're
good "cold" backups - should also be able to integrate with
databases to trigger "cold" backups (cleanly shutdown database,
backup "cold", restart database if it was brought down for "cold"
backup).
o should be able to intelligently detect at least many cases where
database isn't being backed up properly (e.g. one pointed it at
filesystem containing database, but didn't configure it to be able
to talk to the database to safely take a "hot" backup - and the
database is live and changing as it's being requested to back up
those files). Perhaps this can be combined with feature(s) below,
but may be more user-friendly if it can also more intelligently
identify such database (e.g. report not just the discovered file(s),
but the database(s) using that/those file(s)).
o live filesystems: backing up rw mounted filesystems is an imperfect
science. Should be capable of detecting and, if desired, logging
and/or warning about issues encountered, and if/as feasible,
automagically recover from and back them up safely. E.g. let's say
we're backing up a large file, and it changes as we're backing it up -
say we're 3/4 through backing it up, when the file has data we already
read and wrote to backup changes, and perhaps the file grows - or
shrinks - maybe even shrinks to less than the length of data we've
already written - can we at least detect and reasonably warn about
this? If feasible, can we automagically correct? (if we read it
again end-to-end, and all's fine and nothing changes that time, can we
just discard what we started with, and use the more recent read pass
and metadata?). If the file keeps changing faster than we can read
it end-to-end, can we at least warn about that? Should we skip
re-read recovery attempt if the file is larger than some threshhold
size, or if it took longer than some amount of time to read already?
o snapshots - should be capable of using/interacting with
filesystem/device snapshots to take safe(er) full (or
incremental/differential) backups of live rw mounted filesystems.
o incremental/differential backups - should be fully capable of such.
Should also be able to do so intelligently on POSIX filesystems via
use of ctime. Should also be able to (as configured) specifically
distrust ctime from any and/or all specific host(s) and/or
filesystem(s), and/or force validation (e.g. compare hashes and/or
actual data, as desired/configured).
o link-farms - should be able to intelligently do link-farm (multiple
hard link) type backups
o should provide sufficiently intelligent/capable interfaces to the
backedup/archived data, and preferably in forms familiar to the
user(s)/administrator(s) - e.g. as a (ro) filesystem interface
o should be able to manage and enforce retention schedules (e.g. must
retain until, must destroy by), should also be able to do that quite
intelligently based on logical criteria (e.g. for some filesystem on
some host, for all backups older than one month, only keep the most
recent backup within each week, and for all those backups older than a
year, only keep the most recent within each month). Should be able to
retroactively change retention policies after backups are created.
For backup in archive format, may be desired to give distinct
retention polices within the archive after such was created, such
should (by default) warn, but allow such archive to be split to
enforce retention policy (split might be deferred until first
expiration within archive is to be enforced). Specific implications
of "destroy"/drop in retention policies should be quite configurable
based upon security needs (or lack thereof) - e.g. can vary from
free/deallocate (allow for reuse), to specified overwrite operations
on key(s) and/or backed up data.
o encryption and key management - need to be able to intelligently
handle and manage that - should also be able to handle high(er)
security key management policies - e.g. restricting how keys are
stored and managed (e.g. the keys for this subset of servers can only
be held in RAM on that encryption server, and they can only be backed
up from that RAM by encrypting to this authorized set of semi-master
keys, and then only writing to such-and-such device directly connected
to that encryption server, and also containing on that device,
key/passprhase which authenticates that device's authorization to hold
those keys encrypted to one of those other authorized semi-master
keys).
o public key encryption should be leveraged as and where appropriate,
and likewise for symmetric key encryption (e.g. bulk session
encryption speed efficiency). E.g. TLS/SSL/ssh/gpg/etc. may be
leveraged for various desired purposes (e.g. authenticating client to
server and vice versa).
o most or all encryption bits should be optional (e.g. they may be
undesired in some environments, or the speed/efficiency gains may be
preferred over encryption security) and should also be
tunable/configurable in general and quite specifically. (e.g. one may
wish to not encrypt backups of data that's public, but one may wish to
ensure there are enough cryptographic based checks to ensure integrity
of such backups (so, e.g., at least any alterations to the backed up
data can be detected).
o encryption key management may be leveraged to aid enforcement of
retention policies. E.g., optionally, destroy backup can be as
efficient as destroy key. For higher security requirements, overwrite
patterns/levels can be set/enforced for keys and/or wipes of backup
data.
o may be highly useful to also incorporate management of existing backup
frameworks (e.g. Amanda). I.e. don't obsolete what's already in place
and working - at minimum, play nice with it, and perhaps even well
leverage it.
Well, anyway, that's the short start of my wish-list :-) ... I probably
forgot at least some stuff, but that's likely at least most of it.
Oh, can I have world peace too? :-)
> Date: Mon, 15 Mar 2010 13:32:02 -0500
> From: David Hinkle <hinkle at cipafilter.com>
> Subject: [sf-lug] New Open Source Software Proposal
> To: "sf-lug at linuxmafia.com" <sf-lug at linuxmafia.com>
>
> David Rosenstrauch, Alex and I have been talking back and fourth
> about how to get what is probably a pretty standard scenario
> accomplished with open source tools. We can't seem to find any pile
> of software that can make it happen, so I was thinking of writing it.
> First I want to solicit some feedback on my idea.
>
> The plan would be an rsync replacement. Instead of syncing local
> files to a remote fileserver over ssh, it would instead break the
> local files into chunks, independently encrypt each chunk, and sync
> those chunks over. The chunks could be stored in a sqllight
> database along with the checksum of the original unencrypted contents
> of each chunk and the checksum of the unencrypted file. We would
> key these chunks based on the encrypted filename.
>
> If we save the data in this manner, on subsequent backups, the client
> can ask for a list of checksums, compare those checksums to local
> files, and then transmit any chunks of those local files that may
> have been changed.
>
> This would means we should be able to get rsync like performance
> backing up to an encrypted datastore on a remote server that has no
> knowledge of the encryption key. We would also get the awesome ease
> of use of rsync over ssh. Any server you have shell access to and
> that you can upload files to you could use as a safe remote
> repository for your data.
>
> What do you guys think? Usefull? Not usefull? Would you use it?
More information about the sf-lug
mailing list