donderdag 13 juli 2017

fixing linux I/O errors on disks

Currently my boot disk is having a bad block. It's in a bad place, because it's preventing the OS (Fedora Linux 24) from booting: it drops me into rescue mode. Using the fantastic 'www.system-rescue-cd.org' ISO image on a USB, I'm able to boot the machine and have all kinds of repair and analysis tools to help me find and fix the problem.
Since I have about a dozen harddisks in this system, the first step is to identify the disk that's having the problem. I know it's a 1TB disk and I know it's using LVM. That should be easy:

root@sysresccd /root % pvdisplay
  --- Physical volume ---
  PV Name               /dev/sdb2
  VG Name               fedora
  PV Size               930.51 GiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              238210
  Free PE               0
  Allocated PE          238210
  PV UUID               2qPOfp-dsag-UEVO-487C-b14d-BMFj-WmjQ2F

root@sysresccd /root %


Indeed it was easy :)
Let's see what smart has to say about this disk, because if it's beyond repair we need a different tactic (ddrescue) than when there is only 1 bad block.

root@sysresccd /root % smartctl -a /dev/sdb
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.9.30-std502-amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD10EARS-00Y5B1
Serial Number:    WD-WCAV5R882810
LU WWN Device Id: 5 0014ee 205ba3de6
Firmware Version: 80.00A80
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Jul 10 19:39:47 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (21660) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 249) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       6
  3 Spin_Up_Time            0x0027   214   183   021    Pre-fail  Always       -       5291
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1724
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15818
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1702
192 Power-Off_Retract_Count 0x0032   199   199   000    Old_age   Always       -       1220
193 Load_Cycle_Count        0x0032   147   147   000    Old_age   Always       -       161209
194 Temperature_Celsius     0x0022   111   103   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     15793         1899286208
# 2  Extended offline    Completed without error       00%     15458         -
# 3  Short offline       Completed without error       00%     15455         -
# 4  Short offline       Completed: read failure       40%     15420         2114456
# 5  Short offline       Completed without error       00%     14634         -
# 6  Short offline       Completed without error       00%     14610         -
# 7  Short offline       Completed without error       00%     14422         -
# 8  Short offline       Completed without error       00%     14399         -
# 9  Short offline       Completed without error       00%     14376         -
#10  Short offline       Completed without error       00%     14352         -
#11  Short offline       Completed without error       00%     14328         -
#12  Short offline       Completed without error       00%     14305         -
#13  Short offline       Completed without error       00%     14296         -
#14  Short offline       Completed without error       00%     14276         -
#15  Short offline       Completed without error       00%     14253         -
#16  Short offline       Completed without error       00%     14230         -
#17  Extended offline    Completed without error       00%     14211         -
#18  Short offline       Completed without error       00%     14206         -
#19  Short offline       Completed without error       00%     14182         -
#20  Short offline       Completed without error       00%     14159         -
#21  Short offline       Completed without error       00%     14137         -
1 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@sysresccd /root %

So there is 1 pending sector. That means there is a part of the disk that can't be read properly and is therefore scheduled for re-location. However, it can't be relocated until it's know what data is in that sector. Sound like a chicken-and-egg problem? It is, unless you know a trick.
The disk itself will try to read the sector for a while, and then gives up. It logs a lot of errors in dmesg when this happens, like so:

[14151.587898] ata2.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0
[14151.587903] ata2.00: irq_stat 0x40000008
[14151.587909] ata2.00: failed command: READ FPDMA QUEUED
[14151.587920] ata2.00: cmd 60/08:38:c0:ce:34/00:00:71:00:00/40 tag 7 ncq dma 4096 in
                        res 41/40:00:c0:ce:34/00:00:71:00:00/40 Emask 0x409 (media error) <F>
[14151.587924] ata2.00: status: { DRDY ERR }
[14151.587927] ata2.00: error: { UNC }
[14151.591853] ata2.00: configured for UDMA/133
[14151.591884] sd 1:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[14151.591890] sd 1:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current]
[14151.591896] sd 1:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[14151.591903] sd 1:0:0:0: [sdb] tag#7 CDB: Read(10) 28 00 71 34 ce c0 00 00 08 00
[14151.591907] blk_update_request: I/O error, dev sdb, sector 1899286208
[14151.591911] Buffer I/O error on dev sdb, logical block 237410776, async page read
[14151.591965] ata2: EH complete

Over and over again. Here the sector is printed: 1899286208. This is the same as the 'LBA of first error' that the smart health tests found. Lets verify manually that this sector is indeed having problems:
root@sysresccd /root % hdparm --read-sector 1899286208 /dev/sdb

/dev/sdb:
reading sector 1899286208: SG_IO: bad/missing sense data, sb[]:  70 00 03 00 00 00 00 0a 40 51 e0 01 11 04 00 00 a0 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
root@sysresccd /root %

So how do we fix this?
Either we can try to use ddrescue to read and re-read the same part of that disk over and over again. We can tell ddrescue to try to read the data forwards or backwards, or try it in a number of different ways. But in this case that did not help. The only solution to make sure we know what data is there, is by overwriting it. That can have desastrous consequences (although having a system not being able to boot is already pretty bad), unless we can figure out which file(s) are in the bad sector and copy those exact same files onto there. Since this is the boot disk, that should hopefully be pretty straightforward.
If we wouldn't care about getting correct data back in place, we could just overwrite the sector like so:
hdparm --write-sector 1899286208 /dev/sdb
If you have a raid that can self-heal, like zfs, that would be a great way to solve the problem. if you use mdadm, don't do thsis.

The easiest way to force a write is by using 'badblocks -nvs /dev/sdb2' but that will take a long time (it will read and write back the entire partition) and because it needs to read the data before it can write, the chances of success are not great in this case. So instead we could try to figure out which file or files has or have the bad sector. However, because I'm using LVM, I think that it's going to be quite a challenge to convert the physical sector into a logical block number relative to the partition and LVM logical volume.

So instead I opted to just overwrite the sector and run fsck. If it's in a critical area, it will tell me so and hopefully state which file was having the problem. I can then overwrite it with a proper file.
root@sysresccd /root % hdparm --write-sector 1899286208  --yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb:
re-writing sector 1899286208: succeeded
root@sysresccd /root % hdparm --read-sector 1899286208 /dev/sdb

/dev/sdb:
reading sector 1899286208: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

Just to be clear, this is extremely dangerous. By 0-ing data, you'll have no idea what sector of the operating system or which file was changed. This could have desastrous consequences, such as wiping out your disks, becoming self-aware and many other possible problems.

Very interesting is that the disk is now acting asif nothing is the matter:
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Apparantly the disk no longer thinks the sector is bad and no longer sees a need to map it. Or it did map it, but it's not counting the event. Other sectors on the disk are also unreadable though, but the disk is not reporting a problem about that:
root@sysresccd /root % hdparm --read-sector 1899286209 /dev/sdb

/dev/sdb:
reading sector 1899286209: SG_IO: bad/missing sense data, sb[]:  70 00 03 00 00 00 00 0a 40 51 e0 01 11 04 00 00 a0 c1 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Anyway, after an fsck, everything should now be corrected. A reboot later and the machine is up again. It's almost like nothing ever happened. Except for that data hole... somewhere...



All of this would not have happened if I would have had a mirror for my boot disk. Since I have (almost) all of my important data on ZFS, the potential data loss would be limited to the OS and configuration. That can all be recovered (just re-install) but it's a lot of work that a simple mirror could have prevented. I have a new project to complete :)

woensdag 5 april 2017

The proper way to remove or replace a CEPH osd

I noticed there are a lot of articles, posts and howto's on the internet that say that removing or replacing a CEPH osd is as easy as:

ceph osd down osd.#
ceph osd crush remove osd.#
ceph auth del osd.#
ceph osd rm osd.#
# And then on the node where the osd is placed, do this if needed:
systemctl stop ceph-osd@#
umount /var/lib/ceph/osd/ceph-#

But this is only true when the osd you are doing this to has completely failed! if the OSD still has data and you are doing this because you wish to enlarge the OSD or are replacing the disk proactively, you risk losing a lot of data. That's because, depending on the replication setting of your pools, the data you need may be on that disk you are removing. And if you have a replication count of 2, where there is one copy of the data on the osd you are replacing and another copy on a different osd, you will risk running into read errors when ceph needs to read the other copy to restore the replication count. Again, if you only have a replication count of 2 (not recommended for production systems) and you lose 1 disk or pull it like the internet posts suggest to do, you risk losing (part of) your data. ceph may be able to detect that data loss through scrubbing, but it can't repair it because there are no more redundant copies.
What if you use a replication count of 3? You are still at risk with the often posted way to remove an osd. That's because ceph doesn't guarantee data is spread evenly over the osd's! It 'makes every effort' to do so, but it doesn't guarantee. That means that your osd may actually be holding 2 copies of the data. Especially when you have an imbalanced cluster where not every node has equal weight and your data is not evenly spread (so pretty much every ceph cluster out there). I think you've set a replication count of 3 to specifically not have to deal with those types of problems.
What if you could 'vacate' the osd before just pulling it? That way ceph has the chance to use all replica's to write to a new location. This way there is less risk of data loss (because you still have R copies to read from, where R is your replication setting).

So here is the right way to remove/replace an OSD that still has usable data on it. Use this in production systems:

ceph osd df | grep '^# '
# note the amount of PG's on the OSD
ceph osd out osd.#
sleep 120
ceph osd df | grep '^# '
# you should see the amount of PG's on the osd decrease.
# At some point you'll hit 0:
[root@XXXXX ~]# ceph osd df | egrep '^15 '
15 2.73000        0      0      0      0  -nan -nan 0
# Now you can continue as before:
ceph osd down osd.#
ceph osd crush remove osd.#
ceph auth del osd.#
ceph osd rm osd.#
# And then on the node where the osd is placed, do this if needed:
systemctl stop ceph-osd@#
umount /var/lib/ceph/osd/ceph-#
`

And if you don't plan on moving the disk to some other ceph node, be sure to use 'ceph-disk --zap' to clear it, before you store/reuse the disk.

I hope this prevents you from having some headaches.