David Brown's Blog

David Brown's Blog

David Brown  //  Software engineer/Jazz musician.

Jun 2 / 4:17am

Incremental Backups Considered Harmful

Well, OK, maybe not harmful, but really hard.

It seems like it should be fairly easy. Most archive utilities contain some type of argument to only backup files that are newer than a certain date. With some clever scripting you can arrange to do a "full" backup initially, and then only dump files that have been modified since a certain date.

This has some real problems, mostly having to do with the ability to rename directories. You see, when you rename a directory, the files inside of the directory themselves aren't modified. At best, the incremental backup will completely miss that the directory has been moved (or will notice the new directory, but not the contents). Now modified files in the new directory name will restore in the new place, even though the restored image has the rest of the directory in the old place. Generally, the older the backup the less the resulting restore will actually look like the tree you backed up.

The other problem is that it is important to remove files that have been removed since the last backup as well. Otherwise, at best, the tree will be cluttered with old files that were intended to be deleted. At worst, there won't be enough room in the restore volume to hold all of the data.

Several programs try to do better at this. There are a couple of different approaches to this. Any of them can theoretically work, but minor bugs in the code tend to cause significant data loss.

The first approach is that used by the 'dump' program. This program only copies files that have changed since the last backup. Instead of worrying about figuring out where things have moved when dumping, dump just writes out the contents of directories (names and the inode number of each file) that have been modified. This places the burden of work on the restore utility to figure out how to put the filesystem back during the incremental restore. Restore maintains a database of every file restored, and tries to coordinate a set of file mores as part of the restore to do any directory rearrangement necessary to get the filesystem back into the state it was in. This usually works. The main advantage here is that it makes the backup process quite simple (and fairly fast). However, but bulk of the complexity is in the restore utility, which is run infrequently. Moving the program's complexity to a part that is rarely run tends to keep difficult bugs from being found.

Another approach is that used by GNU tar, as well as by my own Adump backup suite. This approach stores a small (or not so small) database of what the directories looked like at each backup. When performing an incremental, it compares against this list to determine what to back up. In general, when a directory is renamed, these programs will write the entire contents of the directory to the backup again, and write something to indicate that the old directory should be deleted. More of the complexity is moved to the backup portion, which is good. The main disadvantage here is that they tend to copy a lot more data than is really needed.

To finish, I'd like to discuss a little bit, the difference between incremental and differential backups. A clearer way to describe all of these types of backups is with the notion of backup levels. A level '0' backup is a full dump of everything in the filesystem. A level 'n' backup is a dump of all files that have been changed since the last level 'n-1' backup. Using this terminology, a differential backup would follow the pattern 0, 1, 1, 1, 1… and incremental backups would follow the pattern 0, 1, 2, 3… The differential backup has the advantage of only needing to restore two dumps. However, the dumps tend to grow larger and larger, eventually needing to do periodic level 0 dumps again. The incremental backups tend to be small, but the entire chain of backups has to be restored, again encouraging more frequent level 0 backups.

A good solution is a compromise between these. A simple algorithm, based on the Tower of Hanoi problem is to count in binary from zero. For each backup, use a backup level based on the number of set bits in the binary number. This results in the series 0, 1, 1, 2, 1, 2, 2, 3, 1… It makes a good compromise, periodically performing larger earlier level backups so that the incremental chain is shorter (log n, instead of n).

Coming up, something much better, content addressable storage.