[sf-lug] New Open Source (Backup) Software Proposal (?)

Sun Apr 4 11:33:58 PDT 2010

I might be rather - to perhaps quite interested ... but doing such,
and doing it well, could be rather to quite non-trivial.

Perhaps a wiki page to coordinate design of and work on such? ... at
least for starters?  E.g. a page somewhere under:
https://secure.balug.org/wiki/doku.php?id=sf-lug:projects

preface notes/comments:
o Open Source does already have a lot to offer in the realm of backup
   software (e.g. GNU tar, cpio, Amanda, ...)
o Open Source does already have a lot to offer that's useful to, but not
   necessarily, directly, or completely/specifically "backup" software,
   e.g.: dd, rsync, gpg, openssl, ssh/scp/sftp, NFS, RAID, AoE, network
   block devices, LVM, device/filesystem/file encryption, deduplication,
   databases, powerful programming languages, nice GUI
   interfaces/frameworks, etc.
o What I see "typically"/mostly missing in the Open Source realm is
   something that very nicely pulls many of the existing capabilities
   together, and makes an excellent framework for doing backups most any
   way one might need or want to do so.  "Of course" one can often put
   something together fairly easily one's self, mostly using existing
   stuff, and adding, e.g., some bits of scripting or programming to tie
   the various pieces together ... but such tends to be yet another
   one-off (semi-)"solution" - and typically doesn't solve the problem
   quite the same way the next person wants it solved (though it still
   may be useful to many).

What I'd like to see (or would propose):
o An open modular framework - so as to be quite easy to add additional
   features/capabilities/functionality/formats, etc. - in many cases by
   effectively just tying into and leveraging other existing software.
   Should also be able to tie into quite arbitrary software - including
   software that's not necessarily Open Source (e.g. tie in via
   commands/IPC/APIs).
o shouldn't need to implement the whole bloody thing/wishlist at once
   (lest it may never get done, nor even started)
o should be able to add to / extend it, without breaking backwards
   compatibility, and without dragging along excessive cruft for sake of
   backwards compatibility - so, sufficiently reasonable up-front
   design/planning may be quite important.
o should be written in some capable high-level language(s) (e.g. perl,
   python) for the most part - only bits really needing higher levels of
   performance optimization (or perhaps security) should be implemented
   in (most likely) C - but that may (or may not) be at all necessary.
o should also have a nice capable GUI
o absolutely everything that can be functionally done in the GUI must
   also be fully doable from command (CLI).
o "command line window" - something I'd really like to see (have seen it
   in commercial Veritas Volume Manager before - not sure if it exists
   elsewhere).  Don't know about current versions of Veritas Volume
   Manager, but some older versions had a feature, I think it was called
   "open command line window" ... you did that, then you could see, for
   anything and everything that could be done in the GUI, one would see
   the command(s) that one could use from the command line (e.g. shell
   prompt) to accomplish those same tasks.  That was a highly useful
   feature - e.g. if one didn't know how to do it from CLI, could do it
   in the GUI ... and see how it's done, from that, with CLI.  One could
   then take that information/knowledge, and script it for the next
   rather similar several hundred operations, and not spend hours going
   clicky clicky clicky with the GUI.  Bonus if the GUI also reflects
   updates in real-time, even if they're done independently from a
   command prompt.  I've also seen other software take a somewhat similar
   approach - instead of a "command line window", actions done via GUI
   would be written to a log file - in the same form they could also be
   executed from CLI.
o should be able to write in quite standard archive formats, e.g. tar,
   cpio
o shouldn't be limited to only writing existing common archive formats
o should be capable of backups to various media, e.g. disk, flash, tape,
   CD-R[W]/DVD+-R[W]/Blu-ray/... - essentially any reasonably suitable
   rewritable or write-once random access or streaming media.
o should be capable of writing targets as filesystems, filesystem
   images, or archive formats.
o archive formats should be "enhanced" / added to - or likely better
   yet, separate (meta)data also backed up, to quite fully back up all
   relevant data and file metadata.  E.g., got ACLs?  Was that ACL data
   backed up?  Did you back up your sparse files efficiently?  If you
   restore the sparse files you backed up, will they be restored with the
   sparse blocks as sparse, but the null filled blocks not sparse?
o To the extent feasible, restore - even without specific tool(s) or
   software, and bare metal recovery, should be rather to quite easy
   (e.g. rather analogous to the approach Amanda uses - short identifying
   header block, then GNU tar format - dd and tar suffice to restore).
o It shouldn't be mandatory to restore all metadata - e.g. one may want
   to restore to a different filesystem type that doesn't support (the
   same) metadata.  Not restoring backed up metadata should be able to
   generate warning(s) - should also be able to configure suppression of
   such warnings (by default or for just specific invocation, such as
   with CLI option or likewise via GUI option).
o Metadata backup - there should be reasonable capabilities to backup
   metadata - such should be done in forms to both make bare metal
   recovery as easy as feasible, and to also make the metadata readily
   available in human parseable format(s) - and again without needing
   specific software tools at recovery time - so that one can make
   various design/layout/configuration decisions at restore time.  E.g.:
   o Exactly how were the disks/LUNs partitioned?
   o What were the exact sizes of each disk/LUN?
   o How exactly was RAID configured?  Including spares?
   o How big, and how full was each filesystem?
   o How much space was used at and beneath any given directory on any
     filesystem? (at recovery time, one may wish to consider changing
     what is and isn't a separate filesystem where - knowing various
     sizing information may be relatively critical in being able to
     quickly assess and make such decisions).
   o if one splits a filesystem upon recovery, does one know about space
     increases due to what was hard links, but will become separate
     files?
o More/less metadata - what about "foreign" (e.g. NTFS) filesystems?
   What about efficiently backing up filesystems that have less metadata
   than POSIX (e.g. FAT)?  Do we add archive format(s) to cover those,
   and/or have some failsafe (e.g. image) method?
o space efficiency - compression (at least as option), also:
   o option to turn on/off
   o option to adjust compression level/algorithm (space vs. CPU
     tradeoffs)
   o "smart"/auto adaptive option(s) - e.g. to allow the backup software
     to determine if compression is/isn't the bottleneck, and accordingly
     use higher/lower(or none) compression levels
   o play nice/smart with encryption - compression (if at all) before
     encryption, not after.
o scalable/distributable:
   o should work reasonably from small single system writing to floppies,
     CD-R[W], DVD+-R[W], tape, or other disk(s), up to exceedingly large
     distributed enterprise/institution environments
   o must be able to work across network
   o should be able to work intelligently with SAN
   o network protocols and ports - should be as simple as feasible, but
     no simpler (e.g. reduce complexity in allowing through firewalls).
     Should also be rather configurable (e.g. if one wants to run it on,
     e.g., TCP port 22, 25, 80 or 443 because the firewall configurations
     already pass those through, that should be doable).
   o should be able to work/deal reasonably with low-performance
     components (e.g. slow archive devices that are somewhat unreliable
     and should be read verified after write, slow/intermittent network
     connections, etc.)
   o should be able to integrate/manage large numbers of backup clients
     and servers - should interoperate quite nicely
   o should be able to separate out various client/server functionality
     (e.g. to dedicated system(s)/server(s) and/or hardware), e.g.:
     o backup clients (pull source data)
     o encryption layer(s)
     o backup/restore management/control (track/manage everything -
       directly or indirectly)
     o backup targets (write data to backup storage)
   o access controls, etc. - should be able to manage who can do what
     where with what server(s)/data, e.g.:
     o summary reporting - no details below host and filesystem (no
       reporting on specific filenames or users)
     o detail reporting - anything but data in files themselves
     o backup - make backups, but no other write access (or read
       everything, write only to backup targets)
     o restore - read/inspect/search backup data, and write data back to
       original (or alternate) source locations
     o compartmentalization by user/group/server(s)/etc.
     o integration into authentication frameworks (e.g. LDAP)
o deduplication - should be smart about and capable of deduplication,
   should be highly correct with such (always confirm matched data, not
   just match of hash).  Should be able to do deduplication on-the-fly
   (before writing to non-volatile storage) and/or after-the-fact (e.g.
   delay it if necessary for efficiency, or consolidate multiple backup
   archives and apply deduplication across them).
o should intelligently and efficiently handle empty/small files (e.g.
   those of size less than hash of data), but needn't handle them more
   efficiently than source (or target) filesystem itself if writing to
   filesystem rather than an archive format.  Should also be sufficiently
   efficient when writing such to archive format.
o replication - should be able to intelligently replicate backup
   archive(s) or any portion thereof to additional copy(/ies) as and
   where desired (local or across network).  Should be able to specify
   synchronous or asynchronous replication, and any applicable
   tolerances/delays or scheduling of such.  Retention policies should be
   configurable to be distinct or identical across replicated target
   locations.
o should be highly monitorable - should be able to well report on its
   own health (e.g. anything and/or everything that fails proper
   self-checks, or that it's been asked to do and wasn't able to do or
   had problem(s) with).  Should be able to tie in nicely with SNMP and
   other means of monitoring.  Should be able to disable SNMP if desired.
o database backups:
   o should be able to intelligently integrate with databases for proper
     "hot" backups, be able to do "cold" DB backups and know that they're
     good "cold" backups - should also be able to integrate with
     databases to trigger "cold" backups (cleanly shutdown database,
     backup "cold", restart database if it was brought down for "cold"
     backup).
   o should be able to intelligently detect at least many cases where
     database isn't being backed up properly (e.g. one pointed it at
     filesystem containing database, but didn't configure it to be able
     to talk to the database to safely take a "hot" backup - and the
     database is live and changing as it's being requested to back up
     those files).  Perhaps this can be combined with feature(s) below,
     but may be more user-friendly if it can also more intelligently
     identify such database (e.g. report not just the discovered file(s),
     but the database(s) using that/those file(s)).
o live filesystems: backing up rw mounted filesystems is an imperfect
   science.  Should be capable of detecting and, if desired, logging
   and/or warning about issues encountered, and if/as feasible,
   automagically recover from and back them up safely.  E.g. let's say
   we're backing up a large file, and it changes as we're backing it up -
   say we're 3/4 through backing it up, when the file has data we already
   read and wrote to backup changes, and perhaps the file grows - or
   shrinks - maybe even shrinks to less than the length of data we've
   already written - can we at least detect and reasonably warn about
   this?  If feasible, can we automagically correct?  (if we read it
   again end-to-end, and all's fine and nothing changes that time, can we
   just discard what we started with, and use the more recent read pass
   and metadata?).  If the file keeps changing faster than we can read
   it end-to-end, can we at least warn about that?  Should we skip
   re-read recovery attempt if the file is larger than some threshhold
   size, or if it took longer than some amount of time to read already?
o snapshots - should be capable of using/interacting with
   filesystem/device snapshots to take safe(er) full (or
   incremental/differential) backups of live rw mounted filesystems.
o incremental/differential backups - should be fully capable of such.
   Should also be able to do so intelligently on POSIX filesystems via
   use of ctime.  Should also be able to (as configured) specifically
   distrust ctime from any and/or all specific host(s) and/or
   filesystem(s), and/or force validation (e.g. compare hashes and/or
   actual data, as desired/configured).
o link-farms - should be able to intelligently do link-farm (multiple
   hard link) type backups
o should provide sufficiently intelligent/capable interfaces to the
   backedup/archived data, and preferably in forms familiar to the
   user(s)/administrator(s) - e.g. as a (ro) filesystem interface
o should be able to manage and enforce retention schedules (e.g. must
   retain until, must destroy by), should also be able to do that quite
   intelligently based on logical criteria (e.g. for some filesystem on
   some host, for all backups older than one month, only keep the most
   recent backup within each week, and for all those backups older than a
   year, only keep the most recent within each month).  Should be able to
   retroactively change retention policies after backups are created.
   For backup in archive format, may be desired to give distinct
   retention polices within the archive after such was created, such
   should (by default) warn, but allow such archive to be split to
   enforce retention policy (split might be deferred until first
   expiration within archive is to be enforced).  Specific implications
   of "destroy"/drop in retention policies should be quite configurable
   based upon security needs (or lack thereof) - e.g. can vary from
   free/deallocate (allow for reuse), to specified overwrite operations
   on key(s) and/or backed up data.
o encryption and key management - need to be able to intelligently
   handle and manage that - should also be able to handle high(er)
   security key management policies - e.g. restricting how keys are
   stored and managed (e.g. the keys for this subset of servers can only
   be held in RAM on that encryption server, and they can only be backed
   up from that RAM by encrypting to this authorized set of semi-master
   keys, and then only writing to such-and-such device directly connected
   to that encryption server, and also containing on that device,
   key/passprhase which authenticates that device's authorization to hold
   those keys encrypted to one of those other authorized semi-master
   keys).
o public key encryption should be leveraged as and where appropriate,
   and likewise for symmetric key encryption (e.g. bulk session
   encryption speed efficiency).  E.g. TLS/SSL/ssh/gpg/etc. may be
   leveraged for various desired purposes (e.g. authenticating client to
   server and vice versa).
o most or all encryption bits should be optional (e.g. they may be
   undesired in some environments, or the speed/efficiency gains may be
   preferred over encryption security) and should also be
   tunable/configurable in general and quite specifically.  (e.g. one may
   wish to not encrypt backups of data that's public, but one may wish to
   ensure there are enough cryptographic based checks to ensure integrity
   of such backups (so, e.g., at least any alterations to the backed up
   data can be detected).
o encryption key management may be leveraged to aid enforcement of
   retention policies.  E.g., optionally, destroy backup can be as
   efficient as destroy key.  For higher security requirements, overwrite
   patterns/levels can be set/enforced for keys and/or wipes of backup
   data.
o may be highly useful to also incorporate management of existing backup
   frameworks (e.g. Amanda).  I.e. don't obsolete what's already in place
   and working - at minimum, play nice with it, and perhaps even well
   leverage it.

Well, anyway, that's the short start of my wish-list :-) ... I probably
forgot at least some stuff, but that's likely at least most of it.
Oh, can I have world peace too? :-)

> Date: Mon, 15 Mar 2010 13:32:02 -0500
> From: David Hinkle <hinkle at cipafilter.com>
> Subject: [sf-lug] New Open Source Software Proposal
> To: "sf-lug at linuxmafia.com" <sf-lug at linuxmafia.com>
>
> David Rosenstrauch, Alex and I have been talking back and fourth
> about how to get what is probably a pretty standard scenario
> accomplished with open source tools.   We can't seem to find any pile
> of software that can make it happen, so I was thinking of writing it.
> First I want to solicit some feedback on my idea.
>
> The plan would be an rsync replacement.    Instead of syncing local
> files to a remote fileserver over ssh,  it would instead break the
> local files into chunks, independently encrypt each chunk, and sync
> those chunks over.   The chunks could be stored in a sqllight
> database along with the checksum of the original unencrypted contents
> of each chunk and the checksum of the unencrypted file.   We would
> key these chunks based on the encrypted filename.
>
> If we save the data in this manner, on subsequent backups, the client
> can ask for a list of checksums, compare those checksums to local
> files, and then transmit any chunks of those local files that may
> have been changed.
>
> This would means we should be able to get rsync like performance
> backing up to an encrypted datastore on a remote server that has no
> knowledge of the encryption key.   We would also get the awesome ease
> of use of rsync over ssh.   Any server you have shell access to and
> that you can upload files to you could use as a safe remote
> repository for your data.
>
> What do you guys think?  Usefull? Not usefull?  Would you use it?