donderdag 21 maart 2019

The benefits of using a 25gbit switch with 10gbit network cards

At the company I currently work for we are looking into improving our ceph performance. The latency of the storage translates directly into iops and can be captured using this formula: iops = 1/latency (in seconds)
For example with a latency of 4ms, a client would achieve 1/0.002=250 iops
Getting the latency down by 2ms to 2ms will inclease the performance to 500 iops. So only 2ms faster latency doubles the performance! Getting it down 1 more ms brings the performance to 1000. Ofcourse it gets harder and harder to lower latency the closer one gets to 0ms (which would be infinitely fast).

The total latency experienced by a ceph client can be approximated by: software stack latency (ceph code) + kernel latency + network latency + storage latency (including the HBA if you are using one). Each of those can be improved in their own way. In this article we are looking into network latency.

We are already using 10gb switches with 10gb cards. While testing with a vendor I noticed that their usage of a 25gb switch improved the latency a lot. I arranged to borrow a 25gb switch and compared the latency using each switch with different network cards. The following networkcards were selected for the tests:
- curvature Intel 82599 (2S010AC01-CURV)
- QLogic 57810 Dual Port 10Gb Direct Attach/SFP+
- Mellanox ConnectX-3 Pro Dual Port 10 GbE SFP+
- SolarFlare 8522 10Gb 2 Port SFP+
- Lenovo LOM Intel X722 base-T

2 of each of those cards combined with 2 switches gives 30 possible combinations to test with. That seems doable. But during the latency tests I found that jitter (variation in latency) can be quite an important factor, so tests need to run for a while.  In addition, performance differs based on the ethernet frame size that is being transmitted. A test of 10 minutes per frame size and 10 frame sizes already takes 1hour and 40 minutes. And we need at least 3 runs of those per setup to be sure the variance is known. So I cut down the number of tests a bit. This is what will be tested
switch - 1st server nic - 2nd server nic
10g - curvature  - curvature
25g - curvature  - curvature
10g - QLogic - curvature
25g - QLogic - curvature
10g - Mellanox - curvature
25g - Mellanox - curvature
10g - SolarFlare - curvature
25g - SolarFlare - curvature
10g - Mellanox - Mellanox
25g - Mellanox - Mellanox
10g - X722 - X722

Doing these tests resulted in over 5000 datapoints. For the tests I first needed to decide which performance testing tool to use. ping isn't up to the job because ping uses UDP by default and that is a different type of traffic than what we'd actually be using (TCP) and because it's stateless that may impact the measurements. I couldn't get TCP ping to work.
iPerf is then the obvious choice but version 3 of iPerf no longer has the nice jitter output that iPerf2 has. iPerf2 turned out to show a very large variance in the results when run repeatedly. This would mean the run times needed to be much longer than 10 minutes to achieve meaningful results.
Instead I choose qperf, which is also used to test RDMA connections. It's easy to use and provides reasonably consistent results.

For example:
root@pve-04:~# stdbuf -oL  qperf -t 60 -to 10 --use_bits_per_sec -vv -uu -un  x.x.x.x tcp_lat
tcp_lat:
    latency          =        11711 ns
    msg_rate         =        85393 /sec
    msg_size         =            1 bytes
    time             =  60000000000 ns
    timeout          =  10000000000 ns
    loc_cpus_used    =         19.2 % cpus
    loc_cpus_user    =         1.98 % cpus
    loc_cpus_intr    =         0.03 % cpus
    loc_cpus_kernel  =         17.2 % cpus
    loc_cpus_iowait  =         0.02 % cpus
    loc_real_time    =  60000000000 ns
    loc_cpu_time     =  11510000229 ns
    loc_send_bytes   =      2561784 bytes
    loc_recv_bytes   =      2561783 bytes
    loc_send_msgs    =      2561784
    loc_recv_msgs    =      2561783
    rem_cpus_used    =         33.6 % cpus
    rem_cpus_user    =         1.18 % cpus
    rem_cpus_intr    =         0.05 % cpus
    rem_cpus_kernel  =         32.4 % cpus
    rem_real_time    =  60000000000 ns
    rem_cpu_time     =  20170000076 ns
    rem_send_bytes   =      2561784 bytes
    rem_recv_bytes   =      2561784 bytes
    rem_send_msgs    =      2561784
    rem_recv_msgs    =      2561784
root@pve-04:~#

This shows us the latency at 1 byte framesize is 11711 nanoseconds, which is 11 µs, equal to 0.011ms
When we look at our ceph latency, we are now between 0.8ms (for a cluster with 3 NVMe nodes) and 2ms (for slower sata SSDs). Most of our ceph traffic is 9k mtu. The difference between the fastest and slowest network configuration was 72712 ns vs 38444 ns, so about 34268ns faster. With a 3 cluster node that results in double the difference because the master osd needs to communicate the changes to the slave osds. So we gain a latency benefit of 68.5µs per 9k frame. On the total speed of 0.8ms (=800µs) that is almost 10%. 0.8ms gives us 1/0.0008=1250iops but with the faster nic that would be 1/0.00073=1370iops. For the slower SSDs that would instead be 1/0.002=500 iops going to 1/0.001930=518 iops. The benefit for slower storage is relatively smaller but still there. Even for slow storage the benefit is still about 4%.
For other framesizes the differences can be larger (up to 3x faster instead of  2x) or smaller.

Please keep in mind that these are synthetic benchmarks. Only actual ceph benchmarks will tell the true difference. I didn't have the machines available to test that though.