woensdag 14 september 2011

harddisk data safety

When I use a harddisk I know I'm trusting my data, which I care about, to a devise that is guaranteed to give errors.

The rate at which it gives errors is listed in the harddisks specifications. For a 2TB disk the non-recoverable read error rate can be 1 sector in 10^15 bits. That's a good buy. But there are also disks out there that offer a higher error rate, at 1 in 10^14. What does that mean? It means the disk has more errors than one in 10^15 but most can be corrected (by re-reading, error correction calculations, etc). Once in every 10^15 bits it can't recover. With 8 bits per byte that translates to one sector in 133,7TB. That sounds like veeeeery little. But is it? If I read the disk 60 times, I'm sure to get a read error. One sector is still usually 512 (but this will very soon change to 4k). So with 60 full disk reads you risk losing 512 bytes of data. That's nothing, right?
Think again. If this happens to be in the middle of a file you store your adminstration in, you had better have a backup. You have backups right? And not on the same disk, right?

The situation gets much worse when disks get larger. When you but a 3TB disk you only need to do 40 full reads. 4TB disks and you're down to 30. See what I'm getting at? If you have a 10^14 error rate disk of 2TB you get a read error every 6 full reads. That's data loss people. Or is it? The read error was 'unrecoverable', but does that mean that all retries will fail also? I'd love some feedback on that.

Raid is one way to solve this. But beware: If you use raid-5 and even raid-6 you're making the problem worse! If one disk in your raid-5 setup fails, you replace it and a re-build takes place. That's a full read of your disks! That means if any read error occurs then *poof* your data is gone. Cheap read controllers then stop the rebuild and you have just lost everything.
Using raid-1 or raid-10 is a much better solution if you value your data. But more expensive too.

However, most of the time harddisks are not fully used. Especially new disks are mostly empty initially. So if we have this error rate but duplicate the data on disk? If all your data is on the disk twice then the chance of your data being lost is much lower. And lower even when even more copies are made. To do this yourself (saving 6 copies of the same file on your filesystem) is a lot of extra work.

There are filesystems that do this, somewhat. For instance on ZFS you can configure the number of copies that should be made. But I'm looking for something more intelligent. I'm calling out to harddisk makers to make as many copies of the existing data as the free space will allow. Then when an error occurs it can just go look at one of the copies and repair the problem area. Problem solved. Simple idea really. Can it be done? I hope so :)

Geen opmerkingen:

Een reactie posten