woensdag 25 mei 2016

fixing a ceph imbalance

At 50-60% full, our ceph cluster was already saying 1 (and sometimes 2) osd’s were “near full”. That’s odd, for a cluster that’s only used 50-60%. But worse, there are 2 pg’s that are in degraded state because it could not write the 3rd replica. Maybe those problems are related? This situation started after created a new pool with two images in it back when we were on infernal.

details:
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            2 pgs degraded
            2 pgs stuck unclean
            2 pgs undersized
            recovery 696/10319262 objects degraded (0.007%)
            2 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18824, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e105351: 30 osds: 30 up, 30 in
      pgmap v13008499: 2880 pgs, 5 pools, 13406 GB data, 3359 kobjects
            40727 GB used, 20706 GB / 61433 GB avail
            696/10319262 objects degraded (0.007%)
                2878 active+clean
                   2 active+undersized+degraded


ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.92989 root default
-2 22.20000     host md002
18  1.84999         osd.18      up  1.00000          1.00000
19  1.84999         osd.19      up  1.00000          1.00000
20  1.84999         osd.20      up  1.00000          1.00000
21  1.84999         osd.21      up  1.00000          1.00000
22  1.84999         osd.22      up  1.00000          1.00000
23  1.84999         osd.23      up  1.00000          1.00000
24  1.84999         osd.24      up  1.00000          1.00000
25  1.84999         osd.25      up  1.00000          1.00000
26  1.84999         osd.26      up  1.00000          1.00000
27  1.84999         osd.27      up  1.00000          1.00000
28  1.84999         osd.28      up  1.00000          1.00000
29   1.84999        osd.29      up  1.00000          1.00000
-3 22.20000     host md008
 0  1.84999         osd.0       up  1.00000          1.00000
 3  1.84999         osd.3       up  1.00000          1.00000
 5  1.84999         osd.5       up  1.00000          1.00000
 7  1.84999         osd.7       up  1.00000          1.00000
11  1.84999         osd.11      up  1.00000          1.00000
10  1.84999         osd.10      up  1.00000          1.00000
 1  1.84999         osd.1       up  1.00000          1.00000
 6  1.84999         osd.6       up  1.00000          1.00000
 8  1.84999         osd.8       up  1.00000          1.00000
 9  1.84999         osd.9       up  1.00000          1.00000
 2  1.84999         osd.2       up  1.00000          1.00000
 4  1.84999         osd.4       up  1.00000          1.00000
-5 16.37999     host md005
14  2.73000         osd.14      up  0.79999          1.00000
15  2.73000         osd.15      up  0.79999          1.00000
16  2.73000         osd.16      up  0.79999          1.00000
12  2.73000         osd.12      up  0.79999          1.00000
17  2.73000         osd.17      up  0.79999          1.00000
13  2.73000         osd.13      up  0.79999          1.00000

The osd tree shows that we have 3 machines, two of them with about 22TB and one with about 16TB. Ofcourse that means that we can only store about 16TB with a replication count of three if we want each pg to be placed on a separate machine. Currently we are using about 13,4TB. And that’s about 13,4/16=83% of the available capacity. That’s becoming close to a problem.

Because not all pg’s are of equal size (it seems), some disk are also used a lot more than others:
[root@md010 ~]# sudo ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
18 1.84999  1.00000  1861G  1166G   695G 62.64 0.97 238
19 1.84999  1.00000  1861G  1075G   785G 57.78 0.89 223
20 1.84999  1.00000  1861G  1160G   701G 62.32 0.96 278
21 1.84999  1.00000  1861G   972G   888G 52.26 0.81 230
22 1.84999  1.00000  1861G  1073G   788G 57.65 0.89 235
23 1.84999  1.00000  1861G  1077G   784G 57.87 0.89 242
24 1.84999  1.00000  1861G  1135G   725G 61.00 0.94 226
25 1.84999  1.00000  1861G  1154G   707G 62.01 0.96 245
26 1.84999  1.00000  1861G  1096G   764G 58.91 0.91 243
27 1.84999  1.00000  1861G  1080G   781G 58.03 0.90 234
28 1.84999  1.00000  1861G  1036G   825G 55.66 0.86 237
29 1.84999  1.00000  1861G  1224G   637G 65.76 1.02 249
 0 1.84999  1.00000  1861G  1146G   714G 61.61 0.95 242
 3 1.84999  1.00000  1861G  1101G   760G 59.16 0.91 237
 5 1.84999  1.00000  1861G  1122G   739G 60.29 0.93 246
 7 1.84999  1.00000  1861G  1128G   732G 60.64 0.94 249
11 1.84999  1.00000  1861G  1115G   746G 59.92 0.93 226
10 1.84999  1.00000  1861G  1127G   733G 60.58 0.94 229
 1 1.84999  1.00000  1861G  1090G   771G 58.57 0.90 239
 6 1.84999  1.00000  1861G  1248G   612G 67.09 1.04 241
 8 1.84999  1.00000  1861G   902G   959G 48.47 0.75 243
 9 1.84999  1.00000  1861G   987G   874G 53.04 0.82 217
 2 1.84999  1.00000  1861G  1252G   608G 67.31 1.04 259
 4 1.84999  1.00000  1861G  1125G   735G 60.47 0.93 252
14 2.73000  0.79999  2792G  2321G   471G 83.13 1.28 489
15 2.73000  0.79999  2792G  2142G   649G 76.73 1.19 474
16 2.73000  0.79999  2792G  2081G   711G 74.52 1.15 460
12 2.73000  0.79999  2792G  2375G   416G 85.08 1.31 494
17 2.73000  0.79999  2792G  1947G   845G 69.72 1.08 462
13 2.73000  0.79999  2792G  2307G   484G 82.64 1.28 499
              TOTAL 61433G 39778G 21655G 64.75
MIN/MAX VAR: 0.75/1.31  STDDEV: 8.65

So indeed, some disks on md010 (which only has 16TB) are already 85% full.

Since the 3rd machine has some free disk slots I decided to move a disk from md002 and md008 each to md005. That will reduce space on md002 and md008 and add it to md005. Each node will then have 20TB. That should increase the maximum capacity we can store from 16TB to 20TB and it should decrease the used percentage from 83% to 13,4/20=67%. Much better already.

Doing this was easy:
stop the osd
unmount it
identify the physical disk (map the osd nr to the device id to the virtual disk to the physical disk and set blinking on)
move it physically
import the foreign disk on md005
mount it
start the osd
and repeat for the other osd.

The osd tree then looks like this:
ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 60.77974 root default
-2 20.34990     host md002
18  1.84999         osd.18      up  1.00000          1.00000
19  1.84999         osd.19      up  1.00000          1.00000
20  1.84999         osd.20      up  1.00000          1.00000
21  1.84999         osd.21      up  1.00000          1.00000
22  1.84999         osd.22      up  1.00000          1.00000
23  1.84999         osd.23      up  1.00000          1.00000
24  1.84999         osd.24      up  1.00000          1.00000
25  1.84999         osd.25      up  1.00000          1.00000
26  1.84999         osd.26      up  1.00000          1.00000
27  1.84999         osd.27      up  1.00000          1.00000
28  1.84999         osd.28      up  1.00000          1.00000
-3 20.34990     host md008
 0  1.84999         osd.0       up  1.00000          1.00000
 3  1.84999         osd.3       up  1.00000          1.00000
 5  1.84999         osd.5       up  1.00000          1.00000
 7  1.84999         osd.7       up  1.00000          1.00000
10  1.84999         osd.10      up  1.00000          1.00000
 1  1.84999         osd.1       up  1.00000          1.00000
 6  1.84999         osd.6       up  1.00000          1.00000
 8  1.84999         osd.8       up  1.00000          1.00000
 9  1.84999         osd.9       up  1.00000          1.00000
 2  1.84999         osd.2       up  1.00000          1.00000
 4  1.84999         osd.4       up  1.00000          1.00000
-5 20.07994     host md005
14  2.73000         osd.14      up  0.79999          1.00000
15  2.73000         osd.15      up  0.79999          1.00000
16  2.73000         osd.16      up  0.79999          1.00000
12  2.73000         osd.12      up  0.79999          1.00000
17  2.73000         osd.17      up  0.79999          1.00000
13  2.73000         osd.13      up  0.79999          1.00000
11  1.84999         osd.11      up  1.00000          1.00000
29  1.84998         osd.29      up  1.00000          1.00000

Since a lot of data has now moved and some pg’s may be stored on md005 twice now, lots of data will need to move and everything will rebalance. This is what that looks like:
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            13 pgs backfill_toofull
            214 pgs backfill_wait
            71 pgs backfilling
            738 pgs degraded
            52 pgs recovering
            610 pgs recovery_wait
            876 pgs stuck unclean
            76 pgs undersized
            recovery 774247/11332909 objects degraded (6.832%)
            recovery 2100222/11332909 objects misplaced (18.532%)
            2 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18834, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e105775: 30 osds: 30 up, 30 in; 292 remapped pgs
      pgmap v13024840: 2880 pgs, 5 pools, 13490 GB data, 3380 kobjects
            41515 GB used, 19918 GB / 61433 GB avail
            774247/11332909 objects degraded (6.832%)
            2100222/11332909 objects misplaced (18.532%)
                1933 active+clean
                 603 active+recovery_wait+degraded
                 171 active+remapped+wait_backfill
                  52 active+recovering+degraded
                  45 active+undersized+degraded+remapped+backfilling
                  30 active+undersized+degraded+remapped+wait_backfill
                  26 active+remapped+backfilling
                  12 active+remapped+wait_backfill+backfill_toofull
                   7 active+recovery_wait+degraded+remapped
                   1 active+undersized+degraded+remapped+wait_backfill+backfill_toofull
recovery io 351 MB/s, 88 objects/s
  client io 1614 kB/s rd, 158 kB/s wr, 0 op/s rd, 0 op/s wr


during the recovery:
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
18 1.84999  1.00000  1861G  1311G   549G 70.47 1.04 257
19 1.84999  1.00000  1861G  1189G   671G 63.92 0.94 268
20 1.84999  1.00000  1861G  1360G   501G 73.06 1.08 307
21 1.84999  1.00000  1861G  1217G   644G 65.39 0.96 279
22 1.84999  1.00000  1861G  1254G   606G 67.41 0.99 263
23 1.84999  1.00000  1861G  1184G   676G 63.64 0.94 262
24 1.84999  1.00000  1861G  1245G   616G 66.90 0.99 264
25 1.84999  1.00000  1861G  1292G   568G 69.44 1.02 285
26 1.84999  1.00000  1861G  1239G   621G 66.61 0.98 271
27 1.84999  1.00000  1861G  1210G   651G 65.01 0.96 259
28 1.84999  1.00000  1861G  1204G   656G 64.73 0.95 259
 0 1.84999  1.00000  1861G  1218G   643G 65.43 0.96 272
 3 1.84999  1.00000  1861G  1177G   684G 63.25 0.93 262
 5 1.84999  1.00000  1861G  1175G   685G 63.17 0.93 265
 7 1.84999  1.00000  1861G  1270G   591G 68.24 1.01 307
10 1.84999  1.00000  1861G  1162G   699G 62.44 0.92 262
 1 1.84999  1.00000  1861G  1200G   660G 64.51 0.95 274
 6 1.84999  1.00000  1861G  1271G   590G 68.28 1.01 277
 8 1.84999  1.00000  1861G  1024G   836G 55.04 0.81 273
 9 1.84999  1.00000  1861G  1101G   760G 59.16 0.87 250
 2 1.84999  1.00000  1861G  1291G   570G 69.38 1.02 276
 4 1.84999  1.00000  1861G  1167G   694G 62.70 0.92 264
14 2.73000  0.79999  2792G  2363G   428G 84.64 1.25 429
15 2.73000  0.79999  2792G  2189G   603G 78.39 1.15 398
16 2.73000  0.79999  2792G  2169G   622G 77.70 1.14 399
12 2.73000  0.79999  2792G  2388G   404G 85.52 1.26 452
17 2.73000  0.79999  2792G  2072G   720G 74.21 1.09 385
13 2.73000  0.79999  2792G  2321G   471G 83.13 1.22 425
11 1.84999  1.00000  1861G  1071G   790G 57.54 0.85 348
29 1.84998  1.00000  1861G   354G  1506G 19.06 0.28 353
              TOTAL 61433G 41703G 19730G 67.88
MIN/MAX VAR: 0.28/1.26  STDDEV: 11.50


and once it was done, the result is a very surprising:
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            13 pgs backfill_toofull
            6 pgs backfilling
            19 pgs stuck unclean
            recovery 120/10824993 objects degraded (0.001%)
            recovery 184018/10824993 objects misplaced (1.700%)
            2 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e107178: 30 osds: 30 up, 30 in; 19 remapped pgs
      pgmap v13145938: 2880 pgs, 5 pools, 13936 GB data, 3491 kobjects
            42471 GB used, 18962 GB / 61433 GB avail
            120/10824993 objects degraded (0.001%)
            184018/10824993 objects misplaced (1.700%)
                2857 active+clean
                  13 active+remapped+backfill_toofull
                   6 active+remapped+backfilling
                   4 active+clean+scrubbing

So apparently, despite having a better balance and beside that we should be able to store more, we actually have more pg’s with problems now.
When taking a closer look at the individual osd’s:

$ sudo ceph osd df tree
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42403G 19030G 69.02 1.00   0 root default
-2 20.34990        - 20477G 14093G  6383G 68.83 1.00   0     host md002
18  1.84999  1.00000  1861G  1250G   611G 67.16 0.97 247         osd.18
19  1.84999  1.00000  1861G  1248G   613G 67.04 0.97 259         osd.19
20  1.84999  1.00000  1861G  1446G   415G 77.68 1.13 301         osd.20
21  1.84999  1.00000  1861G  1246G   615G 66.96 0.97 271         osd.21
22  1.84999  1.00000  1861G  1197G   664G 64.32 0.93 253         osd.22
23  1.84999  1.00000  1861G  1221G   640G 65.61 0.95 256         osd.23
24  1.84999  1.00000  1861G  1312G   548G 70.51 1.02 257         osd.24
25  1.84999  1.00000  1861G  1434G   426G 77.08 1.12 279         osd.25
26  1.84999  1.00000  1861G  1233G   627G 66.27 0.96 261         osd.26
27  1.84999  1.00000  1861G  1215G   645G 65.31 0.95 248         osd.27
28  1.84999  1.00000  1861G  1287G   574G 69.15 1.00 255         osd.28
-3 20.34990        - 20477G 14215G  6261G 69.42 1.01   0     host md008
 0  1.84999  1.00000  1861G  1305G   555G 70.13 1.02 265         osd.0
 3  1.84999  1.00000  1861G  1324G   537G 71.13 1.03 257         osd.3
 5  1.84999  1.00000  1861G  1210G   650G 65.05 0.94 255         osd.5
 7  1.84999  1.00000  1861G  1402G   459G 75.34 1.09 295         osd.7
10  1.84999  1.00000  1861G  1365G   496G 73.33 1.06 254         osd.10
 1  1.84999  1.00000  1861G  1260G   600G 67.73 0.98 267         osd.1
 6  1.84999  1.00000  1861G  1502G   358G 80.73 1.17 272         osd.6
 8  1.84999  1.00000  1861G  1144G   717G 61.45 0.89 269         osd.8
 9  1.84999  1.00000  1861G  1176G   685G 63.18 0.92 240         osd.9
 2  1.84999  1.00000  1861G  1328G   532G 71.38 1.03 266         osd.2
 4  1.84999  1.00000  1861G  1194G   667G 64.17 0.93 251         osd.4
-5 20.07994        - 20478G 14093G  6385G 68.82 1.00   0     host md005
14  2.73000  0.84999  2792G  1861G   931G 66.64 0.97 392         osd.14
15  2.73000  0.79999  2792G  1769G  1022G 63.37 0.92 354         osd.15
16  2.73000  0.79999  2792G  1722G  1070G 61.68 0.89 353         osd.16
12  2.73000  0.79999  2792G  2016G   775G 72.22 1.05 405         osd.12
17  2.73000  0.79999  2792G  1647G  1145G 59.00 0.85 349         osd.17
13  2.73000  0.79999  2792G  1832G   960G 65.60 0.95 378         osd.13
11  1.84999  1.00000  1861G  1627G   233G 87.44 1.27 326         osd.11
29  1.84998  1.00000  1861G  1616G   245G 86.83 1.26 345         osd.29
               TOTAL 61433G 42403G 19030G 69.02
MIN/MAX VAR: 0.85/1.27  STDDEV: 6.91

We can see that the disks that got moved to md005 (osd’s 11 and 29) ore now almost full, while the other disks in md005, which are much larger, don’t have that much usage at all. We can also see that the larger disks have a lower reweighs, so less data is going there. probably that is no longer needed. For the least used osd’s in md005 I’ll go and increase the reweighs so more data will go there.

There are 2 ways to reweighs:
ceph crush reweigh
ceph osd reweight

Since our osd reweighs is not the default ‘1’, I’ll bring those closer to 1. Also because that setting is apparently not persistent and when an osd goes ‘out’, the setting is lost. More information on that can be found here: http://ceph.com/planet/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight/

$ sudo ceph osd reweight 15 0.85
reweighted osd.15 to 0.85 (d999)
$ sudo ceph osd reweight 16 0.85
reweighted osd.16 to 0.85 (d999)
$ sudo ceph osd reweight 17 0.85
reweighted osd.17 to 0.85 (d999)

I only reweighed the 3 osd’s that have least data on them, and only increased the weight by 0.05. I did this because I wanted to see the effects before taking the next step, and increasing by 0.05 shouldn’t take that many hours to re-align all the PG’s that the crush map now calculated a different place for.

    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            clock skew detected on mon.md010
            16 pgs backfill_toofull
            15 pgs backfill_wait
            22 pgs backfilling
            51 pgs degraded
            23 pgs recovering
            25 pgs recovery_wait
            56 pgs stuck unclean
            recovery 48630/10962135 objects degraded (0.444%)
            recovery 458174/10962135 objects misplaced (4.180%)
            2 near full osd(s)
            Monitor clock skew detected
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e107399: 30 osds: 30 up, 30 in; 51 remapped pgs
      pgmap v13147453: 2880 pgs, 5 pools, 13937 GB data, 3491 kobjects
            42445 GB used, 18988 GB / 61433 GB avail
            48630/10962135 objects degraded (0.444%)
            458174/10962135 objects misplaced (4.180%)
                2778 active+clean
                  25 active+recovery_wait+degraded
                  23 active+recovering+degraded
                  22 active+remapped+backfilling
                  13 active+remapped+backfill_toofull
                  12 active+remapped+wait_backfill
                   3 active+degraded
                   3 active+remapped+wait_backfill+backfill_toofull
                   1 active+remapped
recovery io 361 MB/s, 90 objects/s
  client io 3611 kB/s wr, 0 op/s rd, 3 op/s wr

see? Only a small amount of PG’s is actually involved. Hopefully this change will result in osd’s 11 and 29 to become utilised a bit less and numbers 15 through 17 a bit more.

<some time later>

Indeed it made the situation slightly less bad:
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42538G 18895G 69.24 1.00   0 root default
-2 20.34990        - 20477G 14123G  6353G 68.97 1.00   0     host md002
18  1.84999  1.00000  1861G  1237G   624G 66.47 0.96 246         osd.18
19  1.84999  1.00000  1861G  1243G   617G 66.81 0.96 261         osd.19
20  1.84999  1.00000  1861G  1472G   389G 79.09 1.14 303         osd.20
21  1.84999  1.00000  1861G  1241G   619G 66.70 0.96 268         osd.21
22  1.84999  1.00000  1861G  1220G   640G 65.58 0.95 254         osd.22
23  1.84999  1.00000  1861G  1207G   654G 64.85 0.94 252         osd.23
24  1.84999  1.00000  1861G  1327G   533G 71.32 1.03 258         osd.24
25  1.84999  1.00000  1861G  1434G   426G 77.08 1.11 279         osd.25
26  1.84999  1.00000  1861G  1221G   639G 65.64 0.95 262         osd.26
27  1.84999  1.00000  1861G  1228G   632G 66.01 0.95 249         osd.27
28  1.84999  1.00000  1861G  1287G   574G 69.14 1.00 255         osd.28
-3 20.34990        - 20477G 14247G  6230G 69.58 1.00   0     host md008
 0  1.84999  1.00000  1861G  1293G   568G 69.48 1.00 262         osd.0
 3  1.84999  1.00000  1861G  1326G   535G 71.25 1.03 258         osd.3
 5  1.84999  1.00000  1861G  1234G   627G 66.29 0.96 256         osd.5
 7  1.84999  1.00000  1861G  1406G   455G 75.53 1.09 296         osd.7
10  1.84999  1.00000  1861G  1353G   507G 72.73 1.05 254         osd.10
 1  1.84999  1.00000  1861G  1274G   587G 68.46 0.99 267         osd.1
 6  1.84999  1.00000  1861G  1485G   375G 79.80 1.15 269         osd.6
 8  1.84999  1.00000  1861G  1144G   717G 61.46 0.89 269         osd.8
 9  1.84999  1.00000  1861G  1204G   657G 64.69 0.93 242         osd.9
 2  1.84999  1.00000  1861G  1331G   529G 71.54 1.03 266         osd.2
 4  1.84999  1.00000  1861G  1193G   668G 64.09 0.93 250         osd.4
-5 20.07994        - 20478G 14167G  6311G 69.18 1.00   0     host md005
14  2.73000  0.84999  2792G  1961G   831G 70.23 1.01 398         osd.14
15  2.73000  0.84999  2792G  1850G   942G 66.25 0.96 372         osd.15
16  2.73000  0.84999  2792G  1780G  1012G 63.76 0.92 366         osd.16
12  2.73000  0.79999  2792G  1941G   851G 69.51 1.00 392         osd.12
17  2.73000  0.84999  2792G  1669G  1122G 59.79 0.86 354         osd.17
13  2.73000  0.79999  2792G  1769G  1023G 63.36 0.92 369         osd.13
11  1.84999  1.00000  1861G  1601G   260G 86.03 1.24 315         osd.11
29  1.84998  1.00000  1861G  1593G   268G 85.59 1.24 332         osd.29
               TOTAL 61433G 42538G 18895G 69.24
MIN/MAX VAR: 0.86/1.24  STDDEV: 6.47

The usage on 11 and 29 decreased and on 15, 16 and 17 it increased. Only tiny amounts, but the weighting change was tiny also. Lets make a bigger change:
$ sudo ceph osd reweight 16 0.90
$ sudo ceph osd reweight 17 0.95
$ sudo ceph osd reweight 13 0.90

and the result:

ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42434G 18999G 69.07 1.00   0 root default
-2 20.34990        - 20477G 14098G  6379G 68.85 1.00   0     host md002
18  1.84999  1.00000  1861G  1207G   654G 64.85 0.94 244         osd.18
19  1.84999  1.00000  1861G  1270G   591G 68.25 0.99 265         osd.19
20  1.84999  1.00000  1861G  1459G   401G 78.42 1.14 300         osd.20
21  1.84999  1.00000  1861G  1213G   648G 65.19 0.94 266         osd.21
22  1.84999  1.00000  1861G  1220G   640G 65.59 0.95 252         osd.22
23  1.84999  1.00000  1861G  1213G   647G 65.20 0.94 253         osd.23
24  1.84999  1.00000  1861G  1314G   546G 70.63 1.02 257         osd.24
25  1.84999  1.00000  1861G  1433G   428G 77.00 1.11 278         osd.25
26  1.84999  1.00000  1861G  1225G   636G 65.83 0.95 262         osd.26
27  1.84999  1.00000  1861G  1243G   618G 66.78 0.97 251         osd.27
28  1.84999  1.00000  1861G  1295G   566G 69.60 1.01 254         osd.28
-3 20.34990        - 20477G 14230G  6246G 69.50 1.01   0     host md008
 0  1.84999  1.00000  1861G  1295G   565G 69.61 1.01 260         osd.0
 3  1.84999  1.00000  1861G  1343G   518G 72.17 1.04 257         osd.3
 5  1.84999  1.00000  1861G  1225G   636G 65.82 0.95 257         osd.5
 7  1.84999  1.00000  1861G  1409G   452G 75.71 1.10 298         osd.7
10  1.84999  1.00000  1861G  1339G   522G 71.93 1.04 252         osd.10
 1  1.84999  1.00000  1861G  1269G   592G 68.19 0.99 264         osd.1
 6  1.84999  1.00000  1861G  1496G   364G 80.39 1.16 269         osd.6
 8  1.84999  1.00000  1861G  1145G   715G 61.55 0.89 270         osd.8
 9  1.84999  1.00000  1861G  1204G   657G 64.68 0.94 242         osd.9
 2  1.84999  1.00000  1861G  1330G   531G 71.45 1.03 268         osd.2
 4  1.84999  1.00000  1861G  1171G   690G 62.93 0.91 245         osd.4
-5 20.07994        - 20478G 14105G  6373G 68.88 1.00   0     host md005
14  2.73000  0.84999  2792G  1864G   928G 66.76 0.97 378         osd.14
15  2.73000  0.84999  2792G  1720G  1071G 61.61 0.89 354         osd.15
16  2.73000  0.89999  2792G  1712G  1079G 61.33 0.89 368         osd.16
12  2.73000  0.79999  2792G  1890G   902G 67.69 0.98 376         osd.12
17  2.73000  0.95000  2792G  1870G   921G 66.99 0.97 395         osd.17
13  2.73000  0.89999  2792G  1942G   849G 69.57 1.01 400         osd.13
11  1.84999  1.00000  1861G  1511G   350G 81.18 1.18 301         osd.11
29  1.84998  1.00000  1861G  1592G   269G 85.55 1.24 318         osd.29
               TOTAL 61433G 42434G 18999G 69.07
MIN/MAX VAR: 0.89/1.24  STDDEV: 6.08

so very slowly we’re getting there. some more tweaking and…..

    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            1 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18842, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e107725: 30 osds: 30 up, 30 in
      pgmap v13170442: 2880 pgs, 5 pools, 13944 GB data, 3493 kobjects
            42348 GB used, 19085 GB / 61433 GB avail
                2879 active+clean
                   1 active+clean+scrubbing+deep
  client io 50325 kB/s rd, 13882 kB/s wr, 24 op/s rd, 13 op/s wr

$ sudo ceph osd df tree
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42348G 19085G 68.93 1.00   0 root default
-2 20.34990        - 20477G 14096G  6381G 68.84 1.00   0     host md002
18  1.84999  1.00000  1861G  1191G   670G 64.00 0.93 242         osd.18
19  1.84999  1.00000  1861G  1287G   573G 69.18 1.00 266         osd.19
20  1.84999  1.00000  1861G  1433G   428G 77.00 1.12 298         osd.20
21  1.84999  1.00000  1861G  1210G   650G 65.04 0.94 264         osd.21
22  1.84999  1.00000  1861G  1211G   649G 65.10 0.94 253         osd.22
23  1.84999  1.00000  1861G  1242G   619G 66.72 0.97 255         osd.23
24  1.84999  1.00000  1861G  1316G   544G 70.74 1.03 258         osd.24
25  1.84999  1.00000  1861G  1411G   450G 75.82 1.10 277         osd.25
26  1.84999  1.00000  1861G  1235G   626G 66.35 0.96 261         osd.26
27  1.84999  1.00000  1861G  1258G   602G 67.62 0.98 252         osd.27
28  1.84999  1.00000  1861G  1296G   565G 69.63 1.01 254         osd.28
-3 20.34990        - 20477G 14193G  6284G 69.31 1.01   0     host md008
 0  1.84999  1.00000  1861G  1298G   563G 69.76 1.01 260         osd.0
 3  1.84999  1.00000  1861G  1333G   528G 71.61 1.04 256         osd.3
 5  1.84999  1.00000  1861G  1224G   637G 65.77 0.95 256         osd.5
 7  1.84999  1.00000  1861G  1395G   466G 74.96 1.09 297         osd.7
10  1.84999  1.00000  1861G  1344G   517G 72.20 1.05 254         osd.10
 1  1.84999  1.00000  1861G  1269G   591G 68.22 0.99 264         osd.1
 6  1.84999  1.00000  1861G  1484G   376G 79.76 1.16 268         osd.6
 8  1.84999  1.00000  1861G  1136G   725G 61.02 0.89 270         osd.8
 9  1.84999  1.00000  1861G  1204G   656G 64.71 0.94 242         osd.9
 2  1.84999  1.00000  1861G  1330G   530G 71.48 1.04 268         osd.2
 4  1.84999  1.00000  1861G  1171G   689G 62.95 0.91 245         osd.4
-5 20.07994        - 20478G 14059G  6419G 68.65 1.00   0     host md005
14  2.73000  0.84999  2792G  1826G   966G 65.39 0.95 374         osd.14
15  2.73000  0.89999  2792G  1841G   951G 65.94 0.96 371         osd.15
16  2.73000  0.95000  2792G  1783G  1008G 63.87 0.93 380         osd.16
12  2.73000  0.79999  2792G  1833G   958G 65.66 0.95 368         osd.12
17  2.73000  0.95000  2792G  1769G  1023G 63.36 0.92 381         osd.17
13  2.73000  0.89999  2792G  1903G   889G 68.15 0.99 395         osd.13
11  1.84999  1.00000  1861G  1485G   375G 79.81 1.16 298         osd.11
29  1.84998  1.00000  1861G  1616G   245G 86.81 1.26 313         osd.29
               TOTAL 61433G 42348G 19085G 68.93
MIN/MAX VAR: 0.89/1.26  STDDEV: 5.86

So now the problem of the 2 PG’s that were active+undersized+degraded are now fixed. Still there is 1 osd that’s ‘nearly full’.
The manual tuning has been fun, but it can be done automatically also:
 sudo ceph osd test-reweight-by-utilization
no change
moved 146 / 8640 (1.68981%)
avg 288
stddev 48.1311 -> 50.5021 (expected baseline 16.6853)
min osd.9 with 242 -> 245 pgs (0.840278 -> 0.850694 * mean)
max osd.13 with 395 -> 383 pgs (1.37153 -> 1.32986 * mean)

oload 120
max_change 0.05
max_change_osds 4
average 0.689340
overload 0.827209
osd.29 weight 1.000000 -> 0.950012
osd.17 weight 0.949997 -> 0.999985
osd.16 weight 0.949997 -> 0.999985
osd.14 weight 0.849991 -> 0.896011

that looks like a sensible change. Let’s apply it.
And then the result is…
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_OK
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18870, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e108690: 30 osds: 30 up, 30 in
            flags sortbitwise
      pgmap v13455986: 2880 pgs, 5 pools, 14064 GB data, 3523 kobjects
            42709 GB used, 18724 GB / 61433 GB avail
                2879 active+clean
                   1 active+clean+scrubbing+deep


looking good, all done!

zondag 24 april 2016

Why I moved away from BackupPC for large data sets

In a previous post (http://shootingbits.blogspot.nl/2014/07/migrating-backuppc-from-ext4-to.html) I explained how our BackupPC storage was becoming too small and was having problems. We decided to switch to glusterFS so we could scale more easily.
We did all that, and it worked. Or so it seemed. GlusterFS is a filesystem on top of a filesystem. It is an additional layer over the normal filesystem that decides where to place files (whole files, not parts of them) and can rebalance the placement of those files and check for consistency. But because you (and therefore also your scripts, the OS, etc) still have access to the underlying filesystem, things can change under the nose of GlusterFS. That means that the consistency of the GlusterFS can't be guaranteed at any point in time. As long as you're very careful to always access the storage through Gluster, you should be fine. Even though we did that, we still ended up with some weird situations:
- Having BackupPC create hard links on Gluster was not working properly. I had to change the way BackupPC creates hard links to get it to work. The Perl command used by BackupPC seems to be incompatible with GlusterFS at that time. I've created a bug report at Gluster (https://bugzilla.redhat.com/show_bug.cgi?id=1156022) but as you'll read later it's no longer relevant for us. But I was able to work around it.
- After deleting data on GlusterFS, no space was freed. Even when the last file of a set of hard links was deleted the space did not become available. That meant we would need to keep adding storage, for ever.
- Moving to GlusterFS was hell. Not because of Gluster, but because of the way BackupPC stores data: All files are stored as files. if that file already exists, it creates a hardlink. In theory this is very efficient. Especially for large files. But since we also have tens of millions of small files, that means transferring the data from our old storage to our new storage was very heavily bound by iops. Each file would essentially become an iop. And with tens of millions of files stored in each backup moment, and many backups, that means hundreds of millions of files or links. With the data being read from a 12 disk raid-5 or raid-6 set of 7200rpm disks (which does about 500iops max for random io) it took weeks to transfer the data. Even though it was only about 15TB, the fact that each small files was a file or hardlink made the transfer very slow.

I wanted to replace BackupPC with a block based backup solution that would store the backup not in a file per file but in archives. We ended up with Bareos (www.bareos.com) and it is working well for us. Ofcourse it has its own weird things and limitations, but they are orders of magnitude less bad that BackupPC was for us. Mind you, BackupPC 3 is still a great tool and may work fine for your situation if you don't throw millions of files at it.

And we decided to ditch GlusterFS and moved to Ceph instead. Ceph is a scalable, distributed storage solution that can be used for openstack but will run fine on its own. It is in active development and was acquired by RedHat, which is also the owner of GlusterFS. My suspicion is that Ceph will replace GlusterFS. It works much better for us, although we had large performance problems with it running radios block devices with version 0.94 (hammer) on btrfs osd's. Once we replaced btrfs with xfs it became much more stable (odd's were no longer dropping out that often) and once we moved to 9.2.0 (the next release, infernalis) things became very stable.
We're not getting 10gbit throughput even though the infrastructure is there, the network is tuned for it and the disks and cpu's are not maxed out, but it's good enough for our needs. We'd like to achieve better throughput, but it's not worth the effort currently to dive in very deeply to figure out why we're not getting it. I suspect Ceph is designed for many parallel readers/writers and therefore not optimised for a single user wanting to suck 10gbit/sec out of it. We'll figure it out in the future.

And because we're not running Bareos as our main backup tool, which stores all the files in archives, syncing to our offsite backup solution has become much more easy. No longer we have to transfer hundreds of millions of files to there anymore either, only a number of 'large' archive files. That allows us to achieve a much higher throughput and have much lower iops requirements there also.

It's been a loooong journey to get here, with many lessons learned. But we now have a stable backup solution, hopefully for years to come.

vrijdag 18 december 2015

Fiio q1 review

This is just a very quick first impression of what i think of the fiio q1 after receiving it and teting it to make sure it''s not broken before it goes under the Christmas tree.
The package is a bit small and is decent but not great. This is definitely not giving me the feeling i just spent €90 on this thing. Unpacking shows the device is a bit smaller than i expected and very much lighter. It's soo light that it feels cheap, despite having a good touch to it. But the assembly is als a bit cheap looking. Again, not a great start.
I bought it because the sound from my dell 980sff is total crap. And my airbook isn't really giving the sound I'm hoping for either. I was using an Asus xonar u1 but my headphones are way too loud with it. I need to turn down the volume on the computer first through software down to the lowest level, and then also turn the u1 all the way down. This means I lose a lot of fidelity and it's still lightly too loud. It seems the fiio q1 doesn't do much better because I also need to keep the volume know all the way down (even in low gain mode), although the software volume on the machine can stay at the maximum (which is good for fidelity). So the q1 is an improvement for me over the u1, but whether it's enough is something i need to spend more time on. And I didn't yet really listen to the sound yet so I have no idea about sound quality either.
That will all be soon after Christmas i hope.

Update 2015-12-25
It's Christmas and I found myself with half an hour of time to do some comparison listening. I used flac's at 44.1Khz or 48KHz at 16bits, played with foobar2000 with WASAPI event setting to get data to the Fiio. I listened with a sennheiser hd558 and an akg k550. I'm comparing the sound to a realtek onboard audiochip: the ALC892 (http://www.realtek.com.tw/products/productsView.aspx?Langid=1&PFid=28&Level=5&Conn=4&ProdID=284) and the o2 dac-amp. Although I prefer to have a look at measurements when deciding on a new audio device, doing proper measurements is hard. And interpreting them is also difficult, if not more so. Have a read at http://nwavguy.blogspot.nl/2011/02/testing-methods.html to see what I mean.
Because I don't have expensive measurement gear at my disposal all I can go by is my ears. For me that's in the end the most important thing anyway, because that's what I'm using to listen and enjoy the music. If I happen to like a device with very bad measurements, so be it. But atleast I'll be honest about it :)
Anyway, back to the listening test. I don't like to use vague words to describe what I'm hearing but since there isn't a clear standard for describing differences in this area, I think I'll have to. I'll try to keep it 'real'. The Fiio Q1 has more air in the reproducted sound than the ALC892. It's slightly easier to discern the separate instruments from eachother, but the placement remains the same. There is just a bit more room between the instruments. It sounds less smudged or compacted. Because of this it sounds a bit more 'studio' or 'analytical'. I'm not hearing much more detail in the music, maybe a tiny little bit. There is also a little bit less depth and fun in the music. It's like the improved clarity has removed some of that.
In the end the most pronounced difference I heard with Michael Nyman's 'The heart asks pleasure first' where the fiio makes the music come more alive.

All of this is much less of a difference than I heard with the O2dac-amp, which gave me a real 'wow' effect compared to the ALC892. The Fiio Q1 doesn't do that, but it's an improvement above 'standard' audio, although not by much. The O2 dac-amp is a real winner here, and only slightly more expensive when you build your own. The Q1 however is much smaller, lighter and less finnicky (I had to repair my O2 twice already, that's DIY). And the Q1 is nicer to look at. If you go for sound though, get the O2. It gives the feeling of really being there when listening to the recording. The Q1 really sounds like a recording. That's great for listening on the go or as background music, but when actively listening, the O2 has better sound to my ears.

I was unable to find any 'real' measurements on the internet for the Q1 so far (I'm not talking about the data presented by Fiio, nor by any home using using RMAA). I'd be very interested to check whether what I think I'm hearing is backed up by data, or whether I'm simply affected by subjective bias (http://nwavguy.blogspot.nl/2011/03/dac-listening-challenge-results.html). Please let me know.

woensdag 1 oktober 2014

measurements while building the O2 objective headphone amp

While building the (amazing) O2 objective amp and dac I did some measurements before powering up the unit. This is strongly advised by the guides because it allows me to detect problems and mistakes before adding power, thus hopefully preventing costly burnouts or worse, explosions.
Unfortunately most of the measurements I made were way off from the ones listed in the guides. Here is what I measured:
R1&R2 476 instead of 100-220
R5 191k instead of 100k
R15/18/10/11 240 instead of 1
R16/22 1.55k instead of 1.3k
R7/14/3/20 500 instead of 100-300
(all values are in Ohms)

The measurement device I use is a somewhat expensive voltcraft digital multimeter with a high sampling rate. My suspicion is that this specific MultiMeter is outputting amps while measuring ohms without allowing the amps to drain out before doing another sample. That could result in a higher ohms measurement due to capacitance in the circuit. Unfortunately I did not have a different (digital) MM available to do any measurements with. After checking and triplechecking the placement of the components and the soldering joints I decided to try the unit.
And it works beautifully!

I compared it to all the amplifiers and dacs I have and for its cheap price it could only be matched by equipment 4 times its price or above. I find that to be very good value for money.

I hope this post helps if ever you build your own O2.