Shooting bits: ZFS data corruption and CRC checksum errors

It's been a while since I took some time to analyze the following problem:

server ~ # zpool status -v
pool: mir-2tb
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
   entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 9h26m with 1 errors on Mon Sep 26 09:01:29 2011
config:

   NAME                                                       STATE     READ WRITE CKSUM
   mir-2tb                                                    ONLINE       0     0     1
      mirror-0                                                 ONLINE       0     0     2
        disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD ONLINE       0     0     2
        disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05894          ONLINE       0     0     2
        disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        /mnt/origin/mir-2tb/diskfailure/sdb1.image.dd

So there is a problem here. What this is saying is that the same block of data is not readable from 3 different disks. And the file sdb1.image.dd experienced data loss because of this.

Every week on monday, the counters for the amount of checksum errors will go up. That's because I'm running a scrub every monday morning. And once the scrub is done, the counter is 2 checksum errors higher for each disk, and one checksum error higher for the pool. So it's not like there are more and more sections of data that can't be read.

How did this happen? How small is that chance, that 3 drives are unreadable for the same specific block of data? It's infinitesimally small. And that's not what's going on here. There are 2 possible reasons for this:
- What's going on here is that the ZFS pool was created initially with one source disk, with apparantly 2 bad sectors or bad blocks. Data was copied to that pool before there was a 2nd disk to mirror the data. Once the data was copied and the source disk was empty, the source disk was added to the pool as a mirror. But because of the bad blocks/sectors, 2 blocks could not be copied to the other disk. Those 2 sectors therefore also contain invalid data. Later a 3rd disk was added, but is has the same problem as the 2nd disk. Because the bad sectors can not be read, this will not resolve itself. The data is lost.
- What's going on here is that there is a very specific type of data that triggers a bug in ZFS where the data cannot be stored or the checksum cannot correctly be calculated. The data is there but the checksum fails because of this bug. There is no actual data lost.

I can't remember whether I started this pool with the scenario of having 1 disk and adding more later. I no longer have the source file so I cannot overwrite the data with it. The only option I have is to try and find the block that is broken and force a write to it. This will force the disk to overwrite the data, making it readable again or in case of not being able to write will force the disk to re-allocate the sector.

However, the disks are reporting no problems at all regarding pending sector re-allocations, the only thing I notice is:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail Always       -       0
2 Throughput_Performance 0x0026   055   055   000    Old_age   Always       -       19396
3 Spin_Up_Time            0x0023   068   066   025    Pre-fail Always       -       9994
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       35
5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail Always       -       0
7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5924
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       70
181 Program_Fail_Cnt_Total 0x0022   098   098   000    Old_age   Always       -       48738283
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       37
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   052   000    Old_age   Always       -       31 (Min/Max 14/48)
195 Hardware_ECC_Recovered 0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector 0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       86
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       70

And that's not very helpfull either. There is one counter (Program_Fail_Cnt_Total) that's huge but some googling turns up that many users of this disk report that counter and it's not a problem for any of them. That doesn't worry me.

I wish that zfs had some more tools to make this analysis easier. Why does the CRC occur? What block exactly is having the problem.

So I'm at a loss here... there is a checksum error, but the disks don't report a problem with data on the disks. Is the problem therefore with ZFS, or with one of the disks that has (via ZFS) propagated to the other disks ? I would be very interested to hear anyones feedback on this.

Shooting bits

woensdag 28 september 2011

ZFS data corruption and CRC checksum errors

Geen opmerkingen:

Een reactie posten