Shooting bits

vrijdag 29 juli 2022

dell H755N raid controller's NVMe performance

NVMe drives are very fast compared to SSDs using the SAS interface. Especially the newest PCIe 4 generation. They also have a drawback: They are new(ish), and that means that not all the features that we have come to depend on are always there. One such feature is hardware raid support. It makes a bit of sense, since the NVMe drives connect 'directly' to the PCIe bus. I write it with quotes because often there is some kind of switch between the NVMe drives and the motherboard. That way the limited amount of available PCIe busses won't be eaten up by all the drives.

For example, the Intel 6334 has 64 PCIe lanes. In the case of the intel P5600 NVMe 4 lanes are combined into a bus. Theoretically that cpu could therefore support 64/4=16 NVMe's. But many server chassis can contain 24 NVMe drives. And some PCIe lanes are needed for the network and other communications. So it's not possible to give each NVMe drive it's own dedicated lanes. Hence some switching matrix is needed to share the lanes between several NVMe's.

By now, some hardware vendors have implemented NVMe support on their RAID controllers. One of them is the Dell H755N (actually a Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx). All communication between the NVMe's and the motherboard/cpu flows through the raid-controller. This is similar to how an NVMe PCIe switch works, but with additional features.

Such features can slow down the performance of the storage devices. For example when using ZFS, which is a software storage implementation that works better than hardware RAID does, the best way to reach the harddisks or SSDs is by using a HBA and not a RAID controller. Many RAID controllers implement a 'passthrough' mode though, which should theoretically enable the OS to reach the disk without any overhead.

I wanted to figure out what that's like with the dell H755N controller.

I am comparing 2 different systems. They are similar, but not exactly the same:

- Dell R650 with H755N (8GB nvram, firmware 52.16.1-4074) and 2 intel P5600 1.6TB NVMe's.

- Dell R640 with H750 (8GB nvram, firmware 52.16.1-4074) and 2 intel P5600 1.6TB NVMe's.

The only proper way to benchmark a storage device and be able to compare the results to others is to use fio. There is a wrapper around fio that makes it easier to run the tool many times with different parameters: ezfio.

I've run ezfio in 3 different configurations:

- on the r640 with the H750, where the NVMe does not go through the raid controller but directly to the PCIe bus. This is the benchmark to compare to.

- on the r650 with the H755n while the RAID controller is configured in passthrough mode for the NVMe

- on the r650 with the H755n in raid-1 configuration.

Here are the results:

r640 + H750 passthrough:

r650 + H755n passthough:

r650 + H755n raid-1:

TLDR?

Using the H755N raid controller for NVMe's is slower than a non-raid controller in passthrough mode. The difference is especially noticeable (50-100% performance drop) with small blocksizes (anything under 32KB) and small amounts of threads. There is also a difference with larger blocksizes and larger amounts amounts of threads, but less so. That last difference could possibly be explained by the different generation of CPU being used and the small turbo frequency difference. Using the raid-1 mode over 2 NVMe's cuts the performance down by 70-80%.

donderdag 21 maart 2019

The benefits of using a 25gbit switch with 10gbit network cards

At the company I currently work for we are looking into improving our ceph performance. The latency of the storage translates directly into iops and can be captured using this formula: iops = 1/latency (in seconds)
For example with a latency of 4ms, a client would achieve 1/0.002=250 iops
Getting the latency down by 2ms to 2ms will inclease the performance to 500 iops. So only 2ms faster latency doubles the performance! Getting it down 1 more ms brings the performance to 1000. Ofcourse it gets harder and harder to lower latency the closer one gets to 0ms (which would be infinitely fast).

The total latency experienced by a ceph client can be approximated by: software stack latency (ceph code) + kernel latency + network latency + storage latency (including the HBA if you are using one). Each of those can be improved in their own way. In this article we are looking into network latency.

We are already using 10gb switches with 10gb cards. While testing with a vendor I noticed that their usage of a 25gb switch improved the latency a lot. I arranged to borrow a 25gb switch and compared the latency using each switch with different network cards. The following networkcards were selected for the tests:
- curvature Intel 82599 (2S010AC01-CURV)
- QLogic 57810 Dual Port 10Gb Direct Attach/SFP+
- Mellanox ConnectX-3 Pro Dual Port 10 GbE SFP+
- SolarFlare 8522 10Gb 2 Port SFP+
- Lenovo LOM Intel X722 base-T

2 of each of those cards combined with 2 switches gives 30 possible combinations to test with. That seems doable. But during the latency tests I found that jitter (variation in latency) can be quite an important factor, so tests need to run for a while. In addition, performance differs based on the ethernet frame size that is being transmitted. A test of 10 minutes per frame size and 10 frame sizes already takes 1hour and 40 minutes. And we need at least 3 runs of those per setup to be sure the variance is known. So I cut down the number of tests a bit. This is what will be tested
switch - 1st server nic - 2nd server nic
10g - curvature - curvature
25g - curvature - curvature
10g - QLogic - curvature
25g - QLogic - curvature
10g - Mellanox - curvature
25g - Mellanox - curvature
10g - SolarFlare - curvature
25g - SolarFlare - curvature
10g - Mellanox - Mellanox
25g - Mellanox - Mellanox
10g - X722 - X722

Doing these tests resulted in over 5000 datapoints. For the tests I first needed to decide which performance testing tool to use. ping isn't up to the job because ping uses UDP by default and that is a different type of traffic than what we'd actually be using (TCP) and because it's stateless that may impact the measurements. I couldn't get TCP ping to work.
iPerf is then the obvious choice but version 3 of iPerf no longer has the nice jitter output that iPerf2 has. iPerf2 turned out to show a very large variance in the results when run repeatedly. This would mean the run times needed to be much longer than 10 minutes to achieve meaningful results.
Instead I choose qperf, which is also used to test RDMA connections. It's easy to use and provides reasonably consistent results.

For example:

root@pve-04:~# stdbuf -oL qperf -t 60 -to 10 --use_bits_per_sec -vv -uu -un x.x.x.x tcp_lat

tcp_lat:

latency = 11711 ns

msg_rate = 85393 /sec

msg_size = 1 bytes

time = 60000000000 ns

timeout = 10000000000 ns

loc_cpus_used = 19.2 % cpus

loc_cpus_user = 1.98 % cpus

loc_cpus_intr = 0.03 % cpus

loc_cpus_kernel = 17.2 % cpus

loc_cpus_iowait = 0.02 % cpus

loc_real_time = 60000000000 ns

loc_cpu_time = 11510000229 ns

loc_send_bytes = 2561784 bytes

loc_recv_bytes = 2561783 bytes

loc_send_msgs = 2561784

loc_recv_msgs = 2561783

rem_cpus_used = 33.6 % cpus

rem_cpus_user = 1.18 % cpus

rem_cpus_intr = 0.05 % cpus

rem_cpus_kernel = 32.4 % cpus

rem_real_time = 60000000000 ns

rem_cpu_time = 20170000076 ns

rem_send_bytes = 2561784 bytes

rem_recv_bytes = 2561784 bytes

rem_send_msgs = 2561784

rem_recv_msgs = 2561784

root@pve-04:~#

This shows us the latency at 1 byte framesize is 11711 nanoseconds, which is 11 µs, equal to 0.011ms
When we look at our ceph latency, we are now between 0.8ms (for a cluster with 3 NVMe nodes) and 2ms (for slower sata SSDs). Most of our ceph traffic is 9k mtu. The difference between the fastest and slowest network configuration was 72712 ns vs 38444 ns, so about 34268ns faster. With a 3 cluster node that results in double the difference because the master osd needs to communicate the changes to the slave osds. So we gain a latency benefit of 68.5µs per 9k frame. On the total speed of 0.8ms (=800µs) that is almost 10%. 0.8ms gives us 1/0.0008=1250iops but with the faster nic that would be 1/0.00073=1370iops. For the slower SSDs that would instead be 1/0.002=500 iops going to 1/0.001930=518 iops. The benefit for slower storage is relatively smaller but still there. Even for slow storage the benefit is still about 4%.

For other framesizes the differences can be larger (up to 3x faster instead of 2x) or smaller.

Please keep in mind that these are synthetic benchmarks. Only actual ceph benchmarks will tell the true difference. I didn't have the machines available to test that though.

maandag 5 februari 2018

Have your users monitor their own kubernetes services in a multi-tenant setup

If you are running a kubernetes cluster (I'll call it k8s from now on), or if you are a user of a k8s environment, you're probably doing that for a couple of reasons. If one of those reasons is because you run some webservice that your company or livelihood depends on, then you want to be damn sure that this service is up and running. You can't trust k8s to fix all problems for you, because when your DC goes down, what can k8s do for you? Despite the magic that k8s is, you still need monitoring. And with that, alerting.
Black-box alerting won't change much when you switch from whatever you were using before you used k8s. You'll still want to do synthetic tests from multiple locations, checking for response codes, page content and response times. Perhaps you've even got some simulated user stories that get executed by your external monitoring solution. If you don't, it may be a good idea. Because if the login page loads, that doesn't mean the login is actually working ;)
But white-box monitoring, where you look at the components within k8s, is where you can save your ass. When you get a black-box alert, that is an alert that tells you something bad was visible externally, there were probably signals up front from the internals of the system that could have warned you. Wouldn't it be nice to have those signals available to you, so you could have fixed whatever problem there was before customers were affected?

To monitor k8s internals and the services running on top, there is one 'de-facto' opensource solution available, which is the combination of prometheus and grafana. The reason for this is that monitoring modern and dynamic systems such as k8s is not easy to do using old-world tools such as nagios, zabbix, etc. That is because they don't have the concept of moving targets ("where is that container running this time?") included in their design and retrofitting them to do so is either hard, complex or user unfriendly to work with. The old-style monitoring tools usually don't support high frequency monitoring either (up to 1x per minute at most).

If you run a k8s cluster, you may deploy prometheus and grafana. You'll get the insights you'll need (up to a point, this stuff is in rapid development and it's not a deploy and forget solution). But what about your users? If you run some cluster, chances are that you are not the only user. Perhaps you have external customers that you are running this cluster for and they are paying you to do so. Or perhaps you run this for one or more development teams within your company. Whatever the reason you are running k8s, your users are probably the ones deploying applications. And when something breaks in those namespaces where those applications are running, who should be the one waking up?You, because you are the guy who set up k8s? Or should it really be the person or the team that deployed the software/service into that namespace? Who knows most about that application, it's needs, it's design and it's problems? Exactly.
So then, if the users should be the ones waking up, they should also be the ones to monitor their own namespaces and everything in it. This may be hard in the beginning. You're the k8s expect after all, they are 'just coders' and don't know anything about it. Or they may tell you some other excuse. Or maybe they'll grab the chance with both hands to do these things themselves and take ownership. However it goes, they need their own monitoring implemented. Their own prometheus and grafana.

Yet one thing that is not written about much, and indeed is not yet that easy to set up, is how to handle the situation where you have a multi-tenant prometheus/grafana setup on top of k8s. Almost all documentation assumes having only 1 prometheus and 1 grafana to monitor the entire cluster. There are some important reasons to split this up:
- self-reliance: You should not be involved when applications are deployed, so you should also not be involved when monitoring and alerting for that application is implemented. You could be an advisor, but you should not be the bottleneck.
- differences in needs: Each user or group of users has different needs. Some of those may conflict with each other. If the monitoring is separated, these needs can co-exist.
- performance: When you have 1 prometheus, gathering and storing all that data can become quite resource intensive. By splitting all of this up into smaller pieces, the whole scales better.

When using helm to deploy prometheus, it's possible to specify a destination namespace to which prometheus should be deployed. Development teams could install helm (and tiller) themselves, and then deploy prometheus and grafana using that helm installation. Doing this, helm, prometheus and grafana only have the permissions that you granted them within that namespace, nothing more. So if you've set up RBAC correctly (and did not give the users cluster-admin access), you'll be fine.

donderdag 13 juli 2017

fixing linux I/O errors on disks

Currently my boot disk is having a bad block. It's in a bad place, because it's preventing the OS (Fedora Linux 24) from booting: it drops me into rescue mode. Using the fantastic 'www.system-rescue-cd.org' ISO image on a USB, I'm able to boot the machine and have all kinds of repair and analysis tools to help me find and fix the problem.
Since I have about a dozen harddisks in this system, the first step is to identify the disk that's having the problem. I know it's a 1TB disk and I know it's using LVM. That should be easy:

root@sysresccd /root % pvdisplay
--- Physical volume ---
PV Name /dev/sdb2
VG Name fedora
PV Size 930.51 GiB / not usable 4.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 238210
Free PE 0
Allocated PE 238210
PV UUID 2qPOfp-dsag-UEVO-487C-b14d-BMFj-WmjQ2F

root@sysresccd /root %

Indeed it was easy :)

Let's see what smart has to say about this disk, because if it's beyond repair we need a different tactic (ddrescue) than when there is only 1 bad block.

root@sysresccd /root % smartctl -a /dev/sdb

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.9.30-std502-amd64] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Caviar Green (AF)

Device Model: WDC WD10EARS-00Y5B1

Serial Number: WD-WCAV5R882810

LU WWN Device Id: 5 0014ee 205ba3de6

Firmware Version: 80.00A80

User Capacity: 1,000,204,886,016 bytes [1.00 TB]

Sector Size: 512 bytes logical/physical

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS (minor revision not indicated)

SATA Version is: SATA 2.6, 3.0 Gb/s

Local Time is: Mon Jul 10 19:39:47 2017 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (21660) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 249) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 6

3 Spin_Up_Time 0x0027 214 183 021 Pre-fail Always - 5291

4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1724

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15818

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1702

192 Power-Off_Retract_Count 0x0032 199 199 000 Old_age Always - 1220

193 Load_Cycle_Count 0x0032 147 147 000 Old_age Always - 161209

194 Temperature_Celsius 0x0022 111 103 000 Old_age Always - 36

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed: read failure 90% 15793 1899286208

# 2 Extended offline Completed without error 00% 15458 -

# 3 Short offline Completed without error 00% 15455 -

# 4 Short offline Completed: read failure 40% 15420 2114456

# 5 Short offline Completed without error 00% 14634 -

# 6 Short offline Completed without error 00% 14610 -

# 7 Short offline Completed without error 00% 14422 -

# 8 Short offline Completed without error 00% 14399 -

# 9 Short offline Completed without error 00% 14376 -

#10 Short offline Completed without error 00% 14352 -

#11 Short offline Completed without error 00% 14328 -

#12 Short offline Completed without error 00% 14305 -

#13 Short offline Completed without error 00% 14296 -

#14 Short offline Completed without error 00% 14276 -

#15 Short offline Completed without error 00% 14253 -

#16 Short offline Completed without error 00% 14230 -

#17 Extended offline Completed without error 00% 14211 -

#18 Short offline Completed without error 00% 14206 -

#19 Short offline Completed without error 00% 14182 -

#20 Short offline Completed without error 00% 14159 -

#21 Short offline Completed without error 00% 14137 -

1 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

root@sysresccd /root %

So there is 1 pending sector. That means there is a part of the disk that can't be read properly and is therefore scheduled for re-location. However, it can't be relocated until it's know what data is in that sector. Sound like a chicken-and-egg problem? It is, unless you know a trick.

The disk itself will try to read the sector for a while, and then gives up. It logs a lot of errors in dmesg when this happens, like so:

[14151.587898] ata2.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0

[14151.587903] ata2.00: irq_stat 0x40000008

[14151.587909] ata2.00: failed command: READ FPDMA QUEUED

[14151.587920] ata2.00: cmd 60/08:38:c0:ce:34/00:00:71:00:00/40 tag 7 ncq dma 4096 in

res 41/40:00:c0:ce:34/00:00:71:00:00/40 Emask 0x409 (media error) <F>

[14151.587924] ata2.00: status: { DRDY ERR }

[14151.587927] ata2.00: error: { UNC }

[14151.591853] ata2.00: configured for UDMA/133

[14151.591884] sd 1:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[14151.591890] sd 1:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current]

[14151.591896] sd 1:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed

[14151.591903] sd 1:0:0:0: [sdb] tag#7 CDB: Read(10) 28 00 71 34 ce c0 00 00 08 00

[14151.591907] blk_update_request: I/O error, dev sdb, sector 1899286208

[14151.591911] Buffer I/O error on dev sdb, logical block 237410776, async page read

[14151.591965] ata2: EH complete

Over and over again. Here the sector is printed: 1899286208. This is the same as the 'LBA of first error' that the smart health tests found. Lets verify manually that this sector is indeed having problems:
root@sysresccd /root % hdparm --read-sector 1899286208 /dev/sdb

/dev/sdb:
reading sector 1899286208: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 40 51 e0 01 11 04 00 00 a0 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
root@sysresccd /root %

So how do we fix this?

Either we can try to use ddrescue to read and re-read the same part of that disk over and over again. We can tell ddrescue to try to read the data forwards or backwards, or try it in a number of different ways. But in this case that did not help. The only solution to make sure we know what data is there, is by overwriting it. That can have desastrous consequences (although having a system not being able to boot is already pretty bad), unless we can figure out which file(s) are in the bad sector and copy those exact same files onto there. Since this is the boot disk, that should hopefully be pretty straightforward.
If we wouldn't care about getting correct data back in place, we could just overwrite the sector like so:
hdparm --write-sector 1899286208 /dev/sdb
If you have a raid that can self-heal, like zfs, that would be a great way to solve the problem. if you use mdadm, don't do thsis.

The easiest way to force a write is by using 'badblocks -nvs /dev/sdb2' but that will take a long time (it will read and write back the entire partition) and because it needs to read the data before it can write, the chances of success are not great in this case. So instead we could try to figure out which file or files has or have the bad sector. However, because I'm using LVM, I think that it's going to be quite a challenge to convert the physical sector into a logical block number relative to the partition and LVM logical volume.

So instead I opted to just overwrite the sector and run fsck. If it's in a critical area, it will tell me so and hopefully state which file was having the problem. I can then overwrite it with a proper file.
root@sysresccd /root % hdparm --write-sector 1899286208 --yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb:
re-writing sector 1899286208: succeeded
root@sysresccd /root % hdparm --read-sector 1899286208 /dev/sdb

/dev/sdb:
reading sector 1899286208: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

Just to be clear, this is extremely dangerous. By 0-ing data, you'll have no idea what sector of the operating system or which file was changed. This could have desastrous consequences, such as wiping out your disks, becoming self-aware and many other possible problems.

Very interesting is that the disk is now acting asif nothing is the matter:

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

Apparantly the disk no longer thinks the sector is bad and no longer sees a need to map it. Or it did map it, but it's not counting the event. Other sectors on the disk are also unreadable though, but the disk is not reporting a problem about that:

root@sysresccd /root % hdparm --read-sector 1899286209 /dev/sdb

/dev/sdb:

reading sector 1899286209: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 40 51 e0 01 11 04 00 00 a0 c1 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Anyway, after an fsck, everything should now be corrected. A reboot later and the machine is up again. It's almost like nothing ever happened. Except for that data hole... somewhere...

All of this would not have happened if I would have had a mirror for my boot disk. Since I have (almost) all of my important data on ZFS, the potential data loss would be limited to the OS and configuration. That can all be recovered (just re-install) but it's a lot of work that a simple mirror could have prevented. I have a new project to complete :)

woensdag 5 april 2017

The proper way to remove or replace a CEPH osd

I noticed there are a lot of articles, posts and howto's on the internet that say that removing or replacing a CEPH osd is as easy as:

ceph osd down osd.#
ceph osd crush remove osd.#
ceph auth del osd.#
ceph osd rm osd.#
# And then on the node where the osd is placed, do this if needed:
systemctl stop ceph-osd@#
umount /var/lib/ceph/osd/ceph-#

But this is only true when the osd you are doing this to has completely failed! if the OSD still has data and you are doing this because you wish to enlarge the OSD or are replacing the disk proactively, you risk losing a lot of data. That's because, depending on the replication setting of your pools, the data you need may be on that disk you are removing. And if you have a replication count of 2, where there is one copy of the data on the osd you are replacing and another copy on a different osd, you will risk running into read errors when ceph needs to read the other copy to restore the replication count. Again, if you only have a replication count of 2 (not recommended for production systems) and you lose 1 disk or pull it like the internet posts suggest to do, you risk losing (part of) your data. ceph may be able to detect that data loss through scrubbing, but it can't repair it because there are no more redundant copies.
What if you use a replication count of 3? You are still at risk with the often posted way to remove an osd. That's because ceph doesn't guarantee data is spread evenly over the osd's! It 'makes every effort' to do so, but it doesn't guarantee. That means that your osd may actually be holding 2 copies of the data. Especially when you have an imbalanced cluster where not every node has equal weight and your data is not evenly spread (so pretty much every ceph cluster out there). I think you've set a replication count of 3 to specifically not have to deal with those types of problems.
What if you could 'vacate' the osd before just pulling it? That way ceph has the chance to use all replica's to write to a new location. This way there is less risk of data loss (because you still have R copies to read from, where R is your replication setting).

So here is the right way to remove/replace an OSD that still has usable data on it. Use this in production systems:

ceph osd df | grep '^# '

# note the amount of PG's on the OSD

ceph osd out osd.#

sleep 120

ceph osd df | grep '^# '

# you should see the amount of PG's on the osd decrease.

# At some point you'll hit 0:

[root@XXXXX ~]# ceph osd df | egrep '^15 '

15 2.73000 0 0 0 0 -nan -nan 0

# Now you can continue as before:
ceph osd down osd.#
ceph osd crush remove osd.#
ceph auth del osd.#
ceph osd rm osd.#
# And then on the node where the osd is placed, do this if needed:
systemctl stop ceph-osd@#
umount /var/lib/ceph/osd/ceph-#

And if you don't plan on moving the disk to some other ceph node, be sure to use 'ceph-disk --zap' to clear it, before you store/reuse the disk.

I hope this prevents you from having some headaches.