I looked at my ZFS setup and realised that I have 2 raid-1 setups, both with the same type of disk:
server ~ # zpool status
pool: mir-2tb
pool: mir-2tb
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mir-2tb ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD ONLINE 0 0 0
errors: No known data errors
pool: mir-2tb2
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mir-2tb2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mir-2tb ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD ONLINE 0 0 0
errors: No known data errors
pool: mir-2tb2
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mir-2tb2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05894 ONLINE 0 0 0
disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05884 ONLINE 0 0 0
errors: No known data errors
server ~ #
disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05884 ONLINE 0 0 0
errors: No known data errors
server ~ #
So that's one time ZFS with two hitachi disks and one time ZFS mirror with two samsung disks. That's bad because if there is a problem with a certain type of disk (such as a firmware problem with samsung HD204UI's, see here) there is a high chance the mirror will fail. ZFS can then detect the problem, but possibly not recover.I quickly bought another hitachi disk and added it to the samsung mirror. That allowed me to swap a samsung disk to the hitachi mirror pool. So now I have:
server ~ # zpool status pool: mir-2tb
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mir-2tb ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD ONLINE 0 0 0
disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05894 ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD ONLINE 0 0 0errors: No known data errors
pool: mir-2tb2
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mir-2tb2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30TG22D ONLINE 0 0 0
disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05884 ONLINE 0 0 0
errors: No known data errors
server ~ #
So now one pool has a raid-1 of 3 disks and the other a raid-1 of 2 disks. Availability thus even improved a bit.
Great, problem fixed. Not quite! Just making sure the disks in a pool are of mixed species is not enough.
It's not visible in the zpool status but the disks of the same pool were all on the same sata controller. I have an onboard sata controller and a promise sata controller. Here is a selective lspci:
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] (rev 40)
05:05.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)
So if a controller were to either die, corrupt data or have intermittant problems that could again result in the pool being lost because ZFS wouldn't be able to fix the problem using a source that still has correct data. Easy fix, just make sure that the disks in a zfs pool are spread out over different controllers. That way the chance of problems again diminishes.
Too paranoid? Not really. Before I replaced my server a couple of months ago I was using an onboard sata controller that was corrupting the data transferred to the disks. I never knew until I switched to ZFS.
Don't blow your bits. Spread them around.
Geen opmerkingen:
Een reactie posten