Copying Directory Trees

Q: What is the best way to move a directory tree (or filesystem AKA partition) between physical drives?

A: For reasons of data-protection, you would want to copy the data, and later delete the original after verifying that the copy appears OK.

Directory trees can be moved using rsync, "cp -ax", cpio, or tar. Filesystems can be copied using those methods, or dd, or dump/restore. The situation is complicated by the need in some cases to replicate or not replicate special files, file attributes, device files, and separate filesystems mounted within the directory. Some methods can do "sparse" copying, removing any empty space in the middle of the files that has been allocated but not written to (which you may or may not wish to do, depending on how those files will be used).

Note: On-the-fly compression (option -z for most tools cited) suits copying across networks, to save transmission time; it's counter-productive when copying within a single host.

1. rsync[1]: A good, general-purpose recipe is "rsync -avz olddirectory/ newdirectory/".

Or, being a bit more detailed at the expense of greater complexity, and generalising the method to include cross-network copying between hosts: "rsync -aPSvx --numeric-ids --delete olddirectory/ destination-host:new-directory/"

"-a" (archive) symbolic links, devices, attributes, permissions, ownerships, and modification times are preserved,

"-v" create verbose output regarding progress,

"-z" with gzip compression (which can be omitted as superfluous for normal copying within a single machine, but should be included for copying between Linux hosts as detailed at the bottom of this FAQ).

Option "-x" (don't cross filesystem boundaries) is available, where that is appropriate. If you don't use it, beware of including /proc accidentally.

An "-S" option is available to cause rsync to attempt to sparse copying (almost always desirable).

Option "-P" means keep partially transferred files (if any) rather than throwing them away, and also show a progress indicator.

Option "-v" increases verbosity.

Option "-x" prevents accidentally crossing filesystem boundaries (avoiding which is almost always desirable).

Option "--numeric-ids" for network transfers between hosts means don't rely on preserving the names of owning users and groups, but instead just keep the same numeric values for those (usually the right thing to do).

Option "--delete" means delete files in the destination directory tree that don't exist in the source directory tree. Of course, you should always be really sure you aren't clobbering anything in the destination you need to keep.

Of all the methods detailed here, rsync will probably be the quickest for file-copying that does not involve entire filesystems. It also benefits from having an easy-to-master syntax. It's probably the method you should prefer.

Here's another careful rsync incantation for copying a mounted root filesystem without pitfalls (borrowed with grateful thanks from Arch Linux's wiki):

rsync -aAXHSv /* /path/to/shared/folder --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/lost+found,/home/*/.gvfs}

Explaining the options not already detailed above:

Option "-A" preserves ACLs (implies -p, which means preserve permissions).

Option "-X" preserves extended attributes.

Option "-H" preserves hard links.

The "--exclude" list comprise common virtual or RAMfs filesystems that you really wouldn't want to even try to copy.

2. "cp -ax olddirectory newdirectory" is the simplest method, but with some disadvantages.

The "-a" means preserve symbolic links, preserve file attributes if possible, and copy directories recursively.

The "-x" means stay on this filesystem, i.e., do not copy any files within the directory that are from a different filesystem mounted onto this one. Obviously, that recipe is useful only if all files of interest are within a single filesystem. If not, you can omit the "-x", but then must watch out for unintended side effects, e.g, from accidentally copying the /proc filesystem.

The option "--sparse=[value]" controls whether (and when) cp will attempt sparse copying. Possible values are "always", "never", and "auto". The cp command's heuristic for "auto" mode (which is cp's default) isn't very sophisticated.

3. "cpio": The syntax for this command is more than a bit opaque. One recipe that will generally serve follows: "cd olddirectory && find . -print -depth | cpio -padmuv newdirectory" The first part of that incantation (before "cpio") finds and furnishes the names (full pathnames) of all files within olddirectory, and find's -depth option causes it to process the directory's contents before the directory itself, so that permissions can be preserved. The options are:

"-a" re-set access times of the files being copied (so that it does not look like they have just been read),

"-d" create directories where needed,

"-m" retain the originals' last-modified times,

"-u" unconditionally overwrite any conflicting, preexisting files in newdirectory, and

"-v" create verbose output regarding progress.

Option "--sparse" (when used in the cpio operating mode we employ, here) causes files with large blocks of zeros to be written sparsely on newdirectory.

Caution: cpio is suitable for copying data (specifically, data that include integers in binary formats) between hosts only if the two machines's CPUs use the same byte order and wordsize. If in doubt, use a different tool, such as tar or rsync.

4. "tar": One general-purpose recipe is "(cd olddirectory && 'tar Sczpf - . ) | (cd newdirectory && tar Sxvzpf -)"

"-c" create an archive (in this case in only a virtual sense, as it is piped into the unpack half of the command),

"-p" preserve permissions,

"-f" using a file instead of tape device,

"-" where the aforementioned file is standard output for tar creation or standard input for tar extraction,

"-S" handle sparse files efficiently (keeping the number of blocks consumed to a minimum),

"-v" create verbose output regarding progress",

"-x" extract from an archive.

Do not use the verbose (-v) flag on both halves of this command string.

As with "cp", there is a flag to stay within a single filesystem: "-1", which is short for "one-file-system".

5. "dd": This utility is included here mainly to warn that it is usually the wrong tool. It does bit-by-bit copies of raw devices, which is useful where (e.g.) you want to copy an image of a floppy disk to a file, or vice-versa. Please note that there is no attempt to map around bad sectors on physical media. You could use it to clone one (unmounted) filesystem to the corresponding physical space on a different disk, provided that the disks have identical disk geometry. Otherwise, use one of the other methods.

Syntax: "dd if=/dev/olddevice of=/dev/newdevice"

Here, "if" denotes input file. It might be specified as /dev/sdb1, for example.

"of" denotes output file, e.g., /dev/sdc1.

6. "dump" and "restore": These are Berkeley Unix (BSD) commands to perform tape backup, but can also be used in a command pipe to copy an entire filesystem (but not directories that do not encompass an entire filesystem).

Syntax: "dump -0f - olddirectory | (cd newdirectory && restore -vrf -)"

"-0" level-zero aka full backup,

"-f" use a file rather than the default tape device,

"-" have that device be standard output for dump or standard input for restore,

"-v" create verbose output regarding progress,

"-r" rebuild a filesystem, setting its "dump" backup level to zero.

Although these dump & restore work on entire filesystems, you will want to make sure the source and destination filesystems both exist and are mounted. If necessary, create the destination filesystem with one of the mkfs tools (e.g., mke2fs).

Important: Never use dump/restore on live filesystems, only on absolutely quiescent ones, preferably not even mounted read/write.

All of the above methods also lend themselves to working across networks (between Linux hosts), for machine replication or backup. They require a network transport layer to run over, which can be rsh, SSH, or nc (netcat).

Here are some examples:

"(sleep 10; cd olddirectory && tar Sczpf - . ) | ssh username@newhost 'cd newdirectory && tar Sxvzpf -'"

"(sleep 10; cd olddirectory && tar Sczpf - . ) | nc newhost 7777" on the originating host, and "cd newdirectory && nc -l -p 7777 | tar Sxvzpf" on the destination host.

"rsync -e ssh -avz olddirectory username@newhost:newdirectory"

"rsync -avz olddirectory username@newhost:newdirectory" (to use rsync's default rsh transport).

"ssh root@otherbox 'cd / && find . -xdev -print -depth | cpio -ovc' | dd of=/dev/rmt/0mbn bs=64k"
"ssh root@otherbox 'cd /usr && find . -xdev -print -depth | cpio -ovc' | dd of=/dev/rmt/0mbn bs=64k"
(to do backup to a tape drive over SSH).

(You may get occasional rsync hangs when attempting to copy large groups of files, typically over 1GB, over SSH transport, depending on the versions of rsync and SSH you're using. If rsync seems to always hang on the same file, you're probably encountering a known ssh bug involving deadlocked select(2) calls. See:
http://gcc.gnu.org/ml/gcc/2000-05/msg00248.html , http://zgp.org/pipermail/linux-elitists/2000-August/000789.html , http://www.roads.lut.ac.uk/lists/psst/2000/02/0086.html

"cd olddirectory && find . -print -depth | ssh username@newhost cpio -padmuv newdirectory"

The above recipes will not always work reliably on (or with) other Unixes: In particular, the tar and cp versions assume enhancements that may not be supported in non-Linux implementations of those commands.

Last, a general warning: Beware that any backup-restoration or file-copying program is potentially dangerous to the health of your data. Even if you ensure that you are copying data rather than simultaneously copying and removing the originals (moving the files), you must still be concerned that the very act of writing the copies might be harmful — if you specify the wrong options, or the wrong destination, or a destination where you will be replacing or otherwise wiping out existing contents that you would prefer to preserve. Be careful, make sure you have current backups before you start, and think twice before hitting the Enter key.

(Thanks to Aaron Lehmann of SVLUG for reminding us about netcat.)

[1] unison (http://www.cis.upenn.edu/~bcpierce/unison/) syncs both ways. (rsync is one-way.)