Shooting bits: juli 2017

Currently my boot disk is having a bad block. It's in a bad place, because it's preventing the OS (Fedora Linux 24) from booting: it drops me into rescue mode. Using the fantastic 'www.system-rescue-cd.org' ISO image on a USB, I'm able to boot the machine and have all kinds of repair and analysis tools to help me find and fix the problem.
Since I have about a dozen harddisks in this system, the first step is to identify the disk that's having the problem. I know it's a 1TB disk and I know it's using LVM. That should be easy:

root@sysresccd /root % pvdisplay
--- Physical volume ---
PV Name /dev/sdb2
VG Name fedora
PV Size 930.51 GiB / not usable 4.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 238210
Free PE 0
Allocated PE 238210
PV UUID 2qPOfp-dsag-UEVO-487C-b14d-BMFj-WmjQ2F

root@sysresccd /root %

Indeed it was easy :)

Let's see what smart has to say about this disk, because if it's beyond repair we need a different tactic (ddrescue) than when there is only 1 bad block.

root@sysresccd /root % smartctl -a /dev/sdb

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.9.30-std502-amd64] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Caviar Green (AF)

Device Model: WDC WD10EARS-00Y5B1

Serial Number: WD-WCAV5R882810

LU WWN Device Id: 5 0014ee 205ba3de6

Firmware Version: 80.00A80

User Capacity: 1,000,204,886,016 bytes [1.00 TB]

Sector Size: 512 bytes logical/physical

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS (minor revision not indicated)

SATA Version is: SATA 2.6, 3.0 Gb/s

Local Time is: Mon Jul 10 19:39:47 2017 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (21660) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 249) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 6

3 Spin_Up_Time 0x0027 214 183 021 Pre-fail Always - 5291

4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1724

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15818

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1702

192 Power-Off_Retract_Count 0x0032 199 199 000 Old_age Always - 1220

193 Load_Cycle_Count 0x0032 147 147 000 Old_age Always - 161209

194 Temperature_Celsius 0x0022 111 103 000 Old_age Always - 36

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed: read failure 90% 15793 1899286208

# 2 Extended offline Completed without error 00% 15458 -

# 3 Short offline Completed without error 00% 15455 -

# 4 Short offline Completed: read failure 40% 15420 2114456

# 5 Short offline Completed without error 00% 14634 -

# 6 Short offline Completed without error 00% 14610 -

# 7 Short offline Completed without error 00% 14422 -

# 8 Short offline Completed without error 00% 14399 -

# 9 Short offline Completed without error 00% 14376 -

#10 Short offline Completed without error 00% 14352 -

#11 Short offline Completed without error 00% 14328 -

#12 Short offline Completed without error 00% 14305 -

#13 Short offline Completed without error 00% 14296 -

#14 Short offline Completed without error 00% 14276 -

#15 Short offline Completed without error 00% 14253 -

#16 Short offline Completed without error 00% 14230 -

#17 Extended offline Completed without error 00% 14211 -

#18 Short offline Completed without error 00% 14206 -

#19 Short offline Completed without error 00% 14182 -

#20 Short offline Completed without error 00% 14159 -

#21 Short offline Completed without error 00% 14137 -

1 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

root@sysresccd /root %

So there is 1 pending sector. That means there is a part of the disk that can't be read properly and is therefore scheduled for re-location. However, it can't be relocated until it's know what data is in that sector. Sound like a chicken-and-egg problem? It is, unless you know a trick.

The disk itself will try to read the sector for a while, and then gives up. It logs a lot of errors in dmesg when this happens, like so:

[14151.587898] ata2.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0

[14151.587903] ata2.00: irq_stat 0x40000008

[14151.587909] ata2.00: failed command: READ FPDMA QUEUED

[14151.587920] ata2.00: cmd 60/08:38:c0:ce:34/00:00:71:00:00/40 tag 7 ncq dma 4096 in

res 41/40:00:c0:ce:34/00:00:71:00:00/40 Emask 0x409 (media error) <F>

[14151.587924] ata2.00: status: { DRDY ERR }

[14151.587927] ata2.00: error: { UNC }

[14151.591853] ata2.00: configured for UDMA/133

[14151.591884] sd 1:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[14151.591890] sd 1:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current]

[14151.591896] sd 1:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed

[14151.591903] sd 1:0:0:0: [sdb] tag#7 CDB: Read(10) 28 00 71 34 ce c0 00 00 08 00

[14151.591907] blk_update_request: I/O error, dev sdb, sector 1899286208

[14151.591911] Buffer I/O error on dev sdb, logical block 237410776, async page read

[14151.591965] ata2: EH complete

Over and over again. Here the sector is printed: 1899286208. This is the same as the 'LBA of first error' that the smart health tests found. Lets verify manually that this sector is indeed having problems:
root@sysresccd /root % hdparm --read-sector 1899286208 /dev/sdb

/dev/sdb:
reading sector 1899286208: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 40 51 e0 01 11 04 00 00 a0 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
root@sysresccd /root %

So how do we fix this?

Either we can try to use ddrescue to read and re-read the same part of that disk over and over again. We can tell ddrescue to try to read the data forwards or backwards, or try it in a number of different ways. But in this case that did not help. The only solution to make sure we know what data is there, is by overwriting it. That can have desastrous consequences (although having a system not being able to boot is already pretty bad), unless we can figure out which file(s) are in the bad sector and copy those exact same files onto there. Since this is the boot disk, that should hopefully be pretty straightforward.
If we wouldn't care about getting correct data back in place, we could just overwrite the sector like so:
hdparm --write-sector 1899286208 /dev/sdb
If you have a raid that can self-heal, like zfs, that would be a great way to solve the problem. if you use mdadm, don't do thsis.

The easiest way to force a write is by using 'badblocks -nvs /dev/sdb2' but that will take a long time (it will read and write back the entire partition) and because it needs to read the data before it can write, the chances of success are not great in this case. So instead we could try to figure out which file or files has or have the bad sector. However, because I'm using LVM, I think that it's going to be quite a challenge to convert the physical sector into a logical block number relative to the partition and LVM logical volume.

So instead I opted to just overwrite the sector and run fsck. If it's in a critical area, it will tell me so and hopefully state which file was having the problem. I can then overwrite it with a proper file.
root@sysresccd /root % hdparm --write-sector 1899286208 --yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb:
re-writing sector 1899286208: succeeded
root@sysresccd /root % hdparm --read-sector 1899286208 /dev/sdb

/dev/sdb:
reading sector 1899286208: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

Just to be clear, this is extremely dangerous. By 0-ing data, you'll have no idea what sector of the operating system or which file was changed. This could have desastrous consequences, such as wiping out your disks, becoming self-aware and many other possible problems.

Very interesting is that the disk is now acting asif nothing is the matter:

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

Apparantly the disk no longer thinks the sector is bad and no longer sees a need to map it. Or it did map it, but it's not counting the event. Other sectors on the disk are also unreadable though, but the disk is not reporting a problem about that:

root@sysresccd /root % hdparm --read-sector 1899286209 /dev/sdb

/dev/sdb:

reading sector 1899286209: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 40 51 e0 01 11 04 00 00 a0 c1 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Anyway, after an fsck, everything should now be corrected. A reboot later and the machine is up again. It's almost like nothing ever happened. Except for that data hole... somewhere...

All of this would not have happened if I would have had a mirror for my boot disk. Since I have (almost) all of my important data on ZFS, the potential data loss would be limited to the OS and configuration. That can all be recovered (just re-install) but it's a lot of work that a simple mirror could have prevented. I have a new project to complete :)

Shooting bits

donderdag 13 juli 2017

fixing linux I/O errors on disks