woensdag 5 april 2017

The proper way to remove or replace a CEPH osd

I noticed there are a lot of articles, posts and howto's on the internet that say that removing or replacing a CEPH osd is as easy as:

ceph osd down osd.#
ceph osd crush remove osd.#
ceph auth del osd.#
ceph osd rm osd.#
# And then on the node where the osd is placed, do this if needed:
systemctl stop ceph-osd@#
umount /var/lib/ceph/osd/ceph-#

But this is only true when the osd you are doing this to has completely failed! if the OSD still has data and you are doing this because you wish to enlarge the OSD or are replacing the disk proactively, you risk losing a lot of data. That's because, depending on the replication setting of your pools, the data you need may be on that disk you are removing. And if you have a replication count of 2, where there is one copy of the data on the osd you are replacing and another copy on a different osd, you will risk running into read errors when ceph needs to read the other copy to restore the replication count. Again, if you only have a replication count of 2 (not recommended for production systems) and you lose 1 disk or pull it like the internet posts suggest to do, you risk losing (part of) your data. ceph may be able to detect that data loss through scrubbing, but it can't repair it because there are no more redundant copies.
What if you use a replication count of 3? You are still at risk with the often posted way to remove an osd. That's because ceph doesn't guarantee data is spread evenly over the osd's! It 'makes every effort' to do so, but it doesn't guarantee. That means that your osd may actually be holding 2 copies of the data. Especially when you have an imbalanced cluster where not every node has equal weight and your data is not evenly spread (so pretty much every ceph cluster out there). I think you've set a replication count of 3 to specifically not have to deal with those types of problems.
What if you could 'vacate' the osd before just pulling it? That way ceph has the chance to use all replica's to write to a new location. This way there is less risk of data loss (because you still have R copies to read from, where R is your replication setting).

So here is the right way to remove/replace an OSD that still has usable data on it. Use this in production systems:

ceph osd df | grep '^# '
# note the amount of PG's on the OSD
ceph osd out osd.#
sleep 120
ceph osd df | grep '^# '
# you should see the amount of PG's on the osd decrease.
# At some point you'll hit 0:
[root@XXXXX ~]# ceph osd df | egrep '^15 '
15 2.73000        0      0      0      0  -nan -nan 0
# Now you can continue as before:
ceph osd down osd.#
ceph osd crush remove osd.#
ceph auth del osd.#
ceph osd rm osd.#
# And then on the node where the osd is placed, do this if needed:
systemctl stop ceph-osd@#
umount /var/lib/ceph/osd/ceph-#
`

And if you don't plan on moving the disk to some other ceph node, be sure to use 'ceph-disk --zap' to clear it, before you store/reuse the disk.

I hope this prevents you from having some headaches.