Data Backup and Recovery in Post-Terabyte Era
Disclaimer: I am now employed by a company for which data storage and backup are parts of its business, but the ideas in this post are from earlier time, when I worked for a (relatively big) ISP, and are not connected with my current job.
We have been backing up the data from our disks to tapes ever since the disks came to existence (we already had tapes by that time). But with the development of technology during the past decade or two, this approach is becoming increasingly problematic. The main trouble is that the density of data on disk storage devices grows faster than the speed at which this data can be copied to/from tape backup devices. It may be tolerable when backing up some database takes several hours (if you can do that without taking it offline), but if recovering it from the backup, in case of loss of “online” data, takes several hours, it means that the disaster recovery procedure becomes a disaster of its own.
On the other hand, the price of disk storage is falling at a higher rate than that of tape storage, and proposition of buying twice or thrice more disk arrays instead of a tape library looks more and more viable. Disks are still more expensive than tape, but not that much. There are even commercially available “solutions” that do backup on disks instead of on tapes. The problem is, just replacing the medium does not help that much with speeding up the process.
Now, let’s try to decompose the problem that we are solving when we establish backup on our system. There are two kinds of errors that a proper backup can mitigate if they happen: equipment imposed and application imposed. I am not calling the former “hardware errors” because I also include firmware/software errors (RAID implementation, block device drivers, filesystem code) into this class. And I am not calling the latter “human errors” because, well, errors in the filesystem code are human errors too, and I am clumping together a sloppy sysadmin who accidentally typed “rm * .bak” and a sloppy web application programmer who failed to validate input and allowed an sql injection.
For the first class of errors, “equipment errors”, there is much better remedy than tape backup, it’s RAID technology. In fact, most shops run RAIDs anyway, to reduce the impact of long restoration times, and from experience, there are very little cases when you have to fall back to tape backup after equipment failures. All you need to do is choose a “right” RAID solution (not all RAID controllers are born equal), and manage it in a “right way” (i.e. actually notice when there is a failed element, and replace it quickly). I’d hazard to say, if you have a decent RAID storage system, you can stop bothering about equipment failures at all. And it won’t cost you a fortune.
The second class of errors, the “human”/“application” errors, is much worse, because although getting rid of them “almost completely” is possible (I, for one, certainly hope that the procedure for launching a nuclear attack is made relatively error-resistant!), it is not practical for most “normal” businesses to hire twice the staff and establish approval procedures for every keypress. So, there must be a way to go back in time and restore the “most recent known good” state of data. Normally, it means restoring from the last (or some previous, if you are extremely unlucky) generation of backup.
Now, there is a better way to achieve the same goal: it is to have data storage capable for “persistent snapshots”. While (usually non-persistent) snapshots have been used for a long time, on high-end storage systems, in conjunction with regular tape backup, they where not considered as an alternative to tape backup. On the other hand, Apple’s Time Machine is specifically designed as an alternative to traditional backup, remedying for “second class” of errors, but it is not, in typical configuration, backed by anything that could protect from the errors of the “first class”.
So, I see the data storage system for today and tomorrow use as a decent RAID (e.g., RAID1 over TCP/IP via DRBD worked really well for me, and it even allows for geographically distributed configurations), and a filesystem (or volume manager) capable for reliable persistent snapshots on top of it. Throw in a policy manager that would create new and remove old unneeded snapshot generations, and a monitor that would alert the user in case of component failure, and you get a tape-less system that is as reliable as a system with a tape backup, and can recover from disasters much faster.