zondag 24 april 2016

Why I moved away from BackupPC for large data sets

In a previous post (http://shootingbits.blogspot.nl/2014/07/migrating-backuppc-from-ext4-to.html) I explained how our BackupPC storage was becoming too small and was having problems. We decided to switch to glusterFS so we could scale more easily.
We did all that, and it worked. Or so it seemed. GlusterFS is a filesystem on top of a filesystem. It is an additional layer over the normal filesystem that decides where to place files (whole files, not parts of them) and can rebalance the placement of those files and check for consistency. But because you (and therefore also your scripts, the OS, etc) still have access to the underlying filesystem, things can change under the nose of GlusterFS. That means that the consistency of the GlusterFS can't be guaranteed at any point in time. As long as you're very careful to always access the storage through Gluster, you should be fine. Even though we did that, we still ended up with some weird situations:
- Having BackupPC create hard links on Gluster was not working properly. I had to change the way BackupPC creates hard links to get it to work. The Perl command used by BackupPC seems to be incompatible with GlusterFS at that time. I've created a bug report at Gluster (https://bugzilla.redhat.com/show_bug.cgi?id=1156022) but as you'll read later it's no longer relevant for us. But I was able to work around it.
- After deleting data on GlusterFS, no space was freed. Even when the last file of a set of hard links was deleted the space did not become available. That meant we would need to keep adding storage, for ever.
- Moving to GlusterFS was hell. Not because of Gluster, but because of the way BackupPC stores data: All files are stored as files. if that file already exists, it creates a hardlink. In theory this is very efficient. Especially for large files. But since we also have tens of millions of small files, that means transferring the data from our old storage to our new storage was very heavily bound by iops. Each file would essentially become an iop. And with tens of millions of files stored in each backup moment, and many backups, that means hundreds of millions of files or links. With the data being read from a 12 disk raid-5 or raid-6 set of 7200rpm disks (which does about 500iops max for random io) it took weeks to transfer the data. Even though it was only about 15TB, the fact that each small files was a file or hardlink made the transfer very slow.

I wanted to replace BackupPC with a block based backup solution that would store the backup not in a file per file but in archives. We ended up with Bareos (www.bareos.com) and it is working well for us. Ofcourse it has its own weird things and limitations, but they are orders of magnitude less bad that BackupPC was for us. Mind you, BackupPC 3 is still a great tool and may work fine for your situation if you don't throw millions of files at it.

And we decided to ditch GlusterFS and moved to Ceph instead. Ceph is a scalable, distributed storage solution that can be used for openstack but will run fine on its own. It is in active development and was acquired by RedHat, which is also the owner of GlusterFS. My suspicion is that Ceph will replace GlusterFS. It works much better for us, although we had large performance problems with it running radios block devices with version 0.94 (hammer) on btrfs osd's. Once we replaced btrfs with xfs it became much more stable (odd's were no longer dropping out that often) and once we moved to 9.2.0 (the next release, infernalis) things became very stable.
We're not getting 10gbit throughput even though the infrastructure is there, the network is tuned for it and the disks and cpu's are not maxed out, but it's good enough for our needs. We'd like to achieve better throughput, but it's not worth the effort currently to dive in very deeply to figure out why we're not getting it. I suspect Ceph is designed for many parallel readers/writers and therefore not optimised for a single user wanting to suck 10gbit/sec out of it. We'll figure it out in the future.

And because we're not running Bareos as our main backup tool, which stores all the files in archives, syncing to our offsite backup solution has become much more easy. No longer we have to transfer hundreds of millions of files to there anymore either, only a number of 'large' archive files. That allows us to achieve a much higher throughput and have much lower iops requirements there also.

It's been a loooong journey to get here, with many lessons learned. But we now have a stable backup solution, hopefully for years to come.