Shooting bits

woensdag 25 mei 2016

fixing a ceph imbalance

At 50-60% full, our ceph cluster was already saying 1 (and sometimes 2) osd’s were “near full”. That’s odd, for a cluster that’s only used 50-60%. But worse, there are 2 pg’s that are in degraded state because it could not write the 3rd replica. Maybe those problems are related? This situation started after created a new pool with two images in it back when we were on infernal.

details:

cluster 6318a6a2-808b-45a1-9c89-31575c58de49

health HEALTH_WARN

2 pgs degraded

2 pgs stuck unclean

2 pgs undersized

recovery 696/10319262 objects degraded (0.007%)

2 near full osd(s)

monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}

election epoch 18824, quorum 0,1,2,3 md002,md005,md008,md010

osdmap e105351: 30 osds: 30 up, 30 in

pgmap v13008499: 2880 pgs, 5 pools, 13406 GB data, 3359 kobjects

40727 GB used, 20706 GB / 61433 GB avail

696/10319262 objects degraded (0.007%)

2878 active+clean

2 active+undersized+degraded

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 58.92989 root default

-2 22.20000 host md002

18 1.84999 osd.18 up 1.00000 1.00000

19 1.84999 osd.19 up 1.00000 1.00000

20 1.84999 osd.20 up 1.00000 1.00000

21 1.84999 osd.21 up 1.00000 1.00000

22 1.84999 osd.22 up 1.00000 1.00000

23 1.84999 osd.23 up 1.00000 1.00000

24 1.84999 osd.24 up 1.00000 1.00000

25 1.84999 osd.25 up 1.00000 1.00000

26 1.84999 osd.26 up 1.00000 1.00000

27 1.84999 osd.27 up 1.00000 1.00000

28 1.84999 osd.28 up 1.00000 1.00000

29 1.84999 osd.29 up 1.00000 1.00000

-3 22.20000 host md008

0 1.84999 osd.0 up 1.00000 1.00000

3 1.84999 osd.3 up 1.00000 1.00000

5 1.84999 osd.5 up 1.00000 1.00000

7 1.84999 osd.7 up 1.00000 1.00000

11 1.84999 osd.11 up 1.00000 1.00000

10 1.84999 osd.10 up 1.00000 1.00000

1 1.84999 osd.1 up 1.00000 1.00000

6 1.84999 osd.6 up 1.00000 1.00000

8 1.84999 osd.8 up 1.00000 1.00000

9 1.84999 osd.9 up 1.00000 1.00000

2 1.84999 osd.2 up 1.00000 1.00000

4 1.84999 osd.4 up 1.00000 1.00000

-5 16.37999 host md005

14 2.73000 osd.14 up 0.79999 1.00000

15 2.73000 osd.15 up 0.79999 1.00000

16 2.73000 osd.16 up 0.79999 1.00000

12 2.73000 osd.12 up 0.79999 1.00000

17 2.73000 osd.17 up 0.79999 1.00000

13 2.73000 osd.13 up 0.79999 1.00000

The osd tree shows that we have 3 machines, two of them with about 22TB and one with about 16TB. Ofcourse that means that we can only store about 16TB with a replication count of three if we want each pg to be placed on a separate machine. Currently we are using about 13,4TB. And that’s about 13,4/16=83% of the available capacity. That’s becoming close to a problem.

Because not all pg’s are of equal size (it seems), some disk are also used a lot more than others:

[root@md010 ~]# sudo ceph osd df

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS

18 1.84999 1.00000 1861G 1166G 695G 62.64 0.97 238

19 1.84999 1.00000 1861G 1075G 785G 57.78 0.89 223

20 1.84999 1.00000 1861G 1160G 701G 62.32 0.96 278

21 1.84999 1.00000 1861G 972G 888G 52.26 0.81 230

22 1.84999 1.00000 1861G 1073G 788G 57.65 0.89 235

23 1.84999 1.00000 1861G 1077G 784G 57.87 0.89 242

24 1.84999 1.00000 1861G 1135G 725G 61.00 0.94 226

25 1.84999 1.00000 1861G 1154G 707G 62.01 0.96 245

26 1.84999 1.00000 1861G 1096G 764G 58.91 0.91 243

27 1.84999 1.00000 1861G 1080G 781G 58.03 0.90 234

28 1.84999 1.00000 1861G 1036G 825G 55.66 0.86 237

29 1.84999 1.00000 1861G 1224G 637G 65.76 1.02 249

0 1.84999 1.00000 1861G 1146G 714G 61.61 0.95 242

3 1.84999 1.00000 1861G 1101G 760G 59.16 0.91 237

5 1.84999 1.00000 1861G 1122G 739G 60.29 0.93 246

7 1.84999 1.00000 1861G 1128G 732G 60.64 0.94 249

11 1.84999 1.00000 1861G 1115G 746G 59.92 0.93 226

10 1.84999 1.00000 1861G 1127G 733G 60.58 0.94 229

1 1.84999 1.00000 1861G 1090G 771G 58.57 0.90 239

6 1.84999 1.00000 1861G 1248G 612G 67.09 1.04 241

8 1.84999 1.00000 1861G 902G 959G 48.47 0.75 243

9 1.84999 1.00000 1861G 987G 874G 53.04 0.82 217

2 1.84999 1.00000 1861G 1252G 608G 67.31 1.04 259

4 1.84999 1.00000 1861G 1125G 735G 60.47 0.93 252

14 2.73000 0.79999 2792G 2321G 471G 83.13 1.28 489

15 2.73000 0.79999 2792G 2142G 649G 76.73 1.19 474

16 2.73000 0.79999 2792G 2081G 711G 74.52 1.15 460

12 2.73000 0.79999 2792G 2375G 416G 85.08 1.31 494

17 2.73000 0.79999 2792G 1947G 845G 69.72 1.08 462

13 2.73000 0.79999 2792G 2307G 484G 82.64 1.28 499

TOTAL 61433G 39778G 21655G 64.75

MIN/MAX VAR: 0.75/1.31 STDDEV: 8.65

So indeed, some disks on md010 (which only has 16TB) are already 85% full.

Since the 3rd machine has some free disk slots I decided to move a disk from md002 and md008 each to md005. That will reduce space on md002 and md008 and add it to md005. Each node will then have 20TB. That should increase the maximum capacity we can store from 16TB to 20TB and it should decrease the used percentage from 83% to 13,4/20=67%. Much better already.

Doing this was easy:

stop the osd

unmount it

identify the physical disk (map the osd nr to the device id to the virtual disk to the physical disk and set blinking on)

move it physically

import the foreign disk on md005

mount it

start the osd

and repeat for the other osd.

The osd tree then looks like this:

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 60.77974 root default

-2 20.34990 host md002

18 1.84999 osd.18 up 1.00000 1.00000

19 1.84999 osd.19 up 1.00000 1.00000

20 1.84999 osd.20 up 1.00000 1.00000

21 1.84999 osd.21 up 1.00000 1.00000

22 1.84999 osd.22 up 1.00000 1.00000

23 1.84999 osd.23 up 1.00000 1.00000

24 1.84999 osd.24 up 1.00000 1.00000

25 1.84999 osd.25 up 1.00000 1.00000

26 1.84999 osd.26 up 1.00000 1.00000

27 1.84999 osd.27 up 1.00000 1.00000

28 1.84999 osd.28 up 1.00000 1.00000

-3 20.34990 host md008

0 1.84999 osd.0 up 1.00000 1.00000

3 1.84999 osd.3 up 1.00000 1.00000

5 1.84999 osd.5 up 1.00000 1.00000

7 1.84999 osd.7 up 1.00000 1.00000

10 1.84999 osd.10 up 1.00000 1.00000

1 1.84999 osd.1 up 1.00000 1.00000

6 1.84999 osd.6 up 1.00000 1.00000

8 1.84999 osd.8 up 1.00000 1.00000

9 1.84999 osd.9 up 1.00000 1.00000

2 1.84999 osd.2 up 1.00000 1.00000

4 1.84999 osd.4 up 1.00000 1.00000

-5 20.07994 host md005

14 2.73000 osd.14 up 0.79999 1.00000

15 2.73000 osd.15 up 0.79999 1.00000

16 2.73000 osd.16 up 0.79999 1.00000

12 2.73000 osd.12 up 0.79999 1.00000

17 2.73000 osd.17 up 0.79999 1.00000

13 2.73000 osd.13 up 0.79999 1.00000

11 1.84999 osd.11 up 1.00000 1.00000

29 1.84998 osd.29 up 1.00000 1.00000

Since a lot of data has now moved and some pg’s may be stored on md005 twice now, lots of data will need to move and everything will rebalance. This is what that looks like:

cluster 6318a6a2-808b-45a1-9c89-31575c58de49

health HEALTH_WARN

13 pgs backfill_toofull

214 pgs backfill_wait

71 pgs backfilling

738 pgs degraded

52 pgs recovering

610 pgs recovery_wait

876 pgs stuck unclean

76 pgs undersized

recovery 774247/11332909 objects degraded (6.832%)

recovery 2100222/11332909 objects misplaced (18.532%)

2 near full osd(s)

monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}

election epoch 18834, quorum 0,1,2,3 md002,md005,md008,md010

osdmap e105775: 30 osds: 30 up, 30 in; 292 remapped pgs

pgmap v13024840: 2880 pgs, 5 pools, 13490 GB data, 3380 kobjects

41515 GB used, 19918 GB / 61433 GB avail

774247/11332909 objects degraded (6.832%)

2100222/11332909 objects misplaced (18.532%)

1933 active+clean

603 active+recovery_wait+degraded

171 active+remapped+wait_backfill

52 active+recovering+degraded

45 active+undersized+degraded+remapped+backfilling

30 active+undersized+degraded+remapped+wait_backfill

26 active+remapped+backfilling

12 active+remapped+wait_backfill+backfill_toofull

7 active+recovery_wait+degraded+remapped

1 active+undersized+degraded+remapped+wait_backfill+backfill_toofull

recovery io 351 MB/s, 88 objects/s

client io 1614 kB/s rd, 158 kB/s wr, 0 op/s rd, 0 op/s wr

during the recovery:

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS

18 1.84999 1.00000 1861G 1311G 549G 70.47 1.04 257

19 1.84999 1.00000 1861G 1189G 671G 63.92 0.94 268

20 1.84999 1.00000 1861G 1360G 501G 73.06 1.08 307

21 1.84999 1.00000 1861G 1217G 644G 65.39 0.96 279

22 1.84999 1.00000 1861G 1254G 606G 67.41 0.99 263

23 1.84999 1.00000 1861G 1184G 676G 63.64 0.94 262

24 1.84999 1.00000 1861G 1245G 616G 66.90 0.99 264

25 1.84999 1.00000 1861G 1292G 568G 69.44 1.02 285

26 1.84999 1.00000 1861G 1239G 621G 66.61 0.98 271

27 1.84999 1.00000 1861G 1210G 651G 65.01 0.96 259

28 1.84999 1.00000 1861G 1204G 656G 64.73 0.95 259

0 1.84999 1.00000 1861G 1218G 643G 65.43 0.96 272

3 1.84999 1.00000 1861G 1177G 684G 63.25 0.93 262

5 1.84999 1.00000 1861G 1175G 685G 63.17 0.93 265

7 1.84999 1.00000 1861G 1270G 591G 68.24 1.01 307

10 1.84999 1.00000 1861G 1162G 699G 62.44 0.92 262

1 1.84999 1.00000 1861G 1200G 660G 64.51 0.95 274

6 1.84999 1.00000 1861G 1271G 590G 68.28 1.01 277

8 1.84999 1.00000 1861G 1024G 836G 55.04 0.81 273

9 1.84999 1.00000 1861G 1101G 760G 59.16 0.87 250

2 1.84999 1.00000 1861G 1291G 570G 69.38 1.02 276

4 1.84999 1.00000 1861G 1167G 694G 62.70 0.92 264

14 2.73000 0.79999 2792G 2363G 428G 84.64 1.25 429

15 2.73000 0.79999 2792G 2189G 603G 78.39 1.15 398

16 2.73000 0.79999 2792G 2169G 622G 77.70 1.14 399

12 2.73000 0.79999 2792G 2388G 404G 85.52 1.26 452

17 2.73000 0.79999 2792G 2072G 720G 74.21 1.09 385

13 2.73000 0.79999 2792G 2321G 471G 83.13 1.22 425

11 1.84999 1.00000 1861G 1071G 790G 57.54 0.85 348

29 1.84998 1.00000 1861G 354G 1506G 19.06 0.28 353

TOTAL 61433G 41703G 19730G 67.88

MIN/MAX VAR: 0.28/1.26 STDDEV: 11.50

and once it was done, the result is a very surprising:

cluster 6318a6a2-808b-45a1-9c89-31575c58de49

health HEALTH_WARN

13 pgs backfill_toofull

6 pgs backfilling

19 pgs stuck unclean

recovery 120/10824993 objects degraded (0.001%)

recovery 184018/10824993 objects misplaced (1.700%)

2 near full osd(s)

monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}

election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010

osdmap e107178: 30 osds: 30 up, 30 in; 19 remapped pgs

pgmap v13145938: 2880 pgs, 5 pools, 13936 GB data, 3491 kobjects

42471 GB used, 18962 GB / 61433 GB avail

120/10824993 objects degraded (0.001%)

184018/10824993 objects misplaced (1.700%)

2857 active+clean

13 active+remapped+backfill_toofull

6 active+remapped+backfilling

4 active+clean+scrubbing

So apparently, despite having a better balance and beside that we should be able to store more, we actually have more pg’s with problems now.

When taking a closer look at the individual osd’s:

$ sudo ceph osd df tree

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME

-1 60.77974 - 61433G 42403G 19030G 69.02 1.00 0 root default

-2 20.34990 - 20477G 14093G 6383G 68.83 1.00 0 host md002

18 1.84999 1.00000 1861G 1250G 611G 67.16 0.97 247 osd.18

19 1.84999 1.00000 1861G 1248G 613G 67.04 0.97 259 osd.19

20 1.84999 1.00000 1861G 1446G 415G 77.68 1.13 301 osd.20

21 1.84999 1.00000 1861G 1246G 615G 66.96 0.97 271 osd.21

22 1.84999 1.00000 1861G 1197G 664G 64.32 0.93 253 osd.22

23 1.84999 1.00000 1861G 1221G 640G 65.61 0.95 256 osd.23

24 1.84999 1.00000 1861G 1312G 548G 70.51 1.02 257 osd.24

25 1.84999 1.00000 1861G 1434G 426G 77.08 1.12 279 osd.25

26 1.84999 1.00000 1861G 1233G 627G 66.27 0.96 261 osd.26

27 1.84999 1.00000 1861G 1215G 645G 65.31 0.95 248 osd.27

28 1.84999 1.00000 1861G 1287G 574G 69.15 1.00 255 osd.28

-3 20.34990 - 20477G 14215G 6261G 69.42 1.01 0 host md008

0 1.84999 1.00000 1861G 1305G 555G 70.13 1.02 265 osd.0

3 1.84999 1.00000 1861G 1324G 537G 71.13 1.03 257 osd.3

5 1.84999 1.00000 1861G 1210G 650G 65.05 0.94 255 osd.5

7 1.84999 1.00000 1861G 1402G 459G 75.34 1.09 295 osd.7

10 1.84999 1.00000 1861G 1365G 496G 73.33 1.06 254 osd.10

1 1.84999 1.00000 1861G 1260G 600G 67.73 0.98 267 osd.1

6 1.84999 1.00000 1861G 1502G 358G 80.73 1.17 272 osd.6

8 1.84999 1.00000 1861G 1144G 717G 61.45 0.89 269 osd.8

9 1.84999 1.00000 1861G 1176G 685G 63.18 0.92 240 osd.9

2 1.84999 1.00000 1861G 1328G 532G 71.38 1.03 266 osd.2

4 1.84999 1.00000 1861G 1194G 667G 64.17 0.93 251 osd.4

-5 20.07994 - 20478G 14093G 6385G 68.82 1.00 0 host md005

14 2.73000 0.84999 2792G 1861G 931G 66.64 0.97 392 osd.14

15 2.73000 0.79999 2792G 1769G 1022G 63.37 0.92 354 osd.15

16 2.73000 0.79999 2792G 1722G 1070G 61.68 0.89 353 osd.16

12 2.73000 0.79999 2792G 2016G 775G 72.22 1.05 405 osd.12

17 2.73000 0.79999 2792G 1647G 1145G 59.00 0.85 349 osd.17

13 2.73000 0.79999 2792G 1832G 960G 65.60 0.95 378 osd.13

11 1.84999 1.00000 1861G 1627G 233G 87.44 1.27 326 osd.11

29 1.84998 1.00000 1861G 1616G 245G 86.83 1.26 345 osd.29

TOTAL 61433G 42403G 19030G 69.02

MIN/MAX VAR: 0.85/1.27 STDDEV: 6.91

We can see that the disks that got moved to md005 (osd’s 11 and 29) ore now almost full, while the other disks in md005, which are much larger, don’t have that much usage at all. We can also see that the larger disks have a lower reweighs, so less data is going there. probably that is no longer needed. For the least used osd’s in md005 I’ll go and increase the reweighs so more data will go there.

There are 2 ways to reweighs:

ceph crush reweigh

ceph osd reweight

Since our osd reweighs is not the default ‘1’, I’ll bring those closer to 1. Also because that setting is apparently not persistent and when an osd goes ‘out’, the setting is lost. More information on that can be found here: http://ceph.com/planet/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight/

$ sudo ceph osd reweight 15 0.85

reweighted osd.15 to 0.85 (d999)

$ sudo ceph osd reweight 16 0.85

reweighted osd.16 to 0.85 (d999)

$ sudo ceph osd reweight 17 0.85

reweighted osd.17 to 0.85 (d999)

I only reweighed the 3 osd’s that have least data on them, and only increased the weight by 0.05. I did this because I wanted to see the effects before taking the next step, and increasing by 0.05 shouldn’t take that many hours to re-align all the PG’s that the crush map now calculated a different place for.

cluster 6318a6a2-808b-45a1-9c89-31575c58de49

health HEALTH_WARN

clock skew detected on mon.md010

16 pgs backfill_toofull

15 pgs backfill_wait

22 pgs backfilling

51 pgs degraded

23 pgs recovering

25 pgs recovery_wait

56 pgs stuck unclean

recovery 48630/10962135 objects degraded (0.444%)

recovery 458174/10962135 objects misplaced (4.180%)

2 near full osd(s)

Monitor clock skew detected

monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}

election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010

osdmap e107399: 30 osds: 30 up, 30 in; 51 remapped pgs

pgmap v13147453: 2880 pgs, 5 pools, 13937 GB data, 3491 kobjects

42445 GB used, 18988 GB / 61433 GB avail

48630/10962135 objects degraded (0.444%)

458174/10962135 objects misplaced (4.180%)

2778 active+clean

25 active+recovery_wait+degraded

23 active+recovering+degraded

22 active+remapped+backfilling

13 active+remapped+backfill_toofull

12 active+remapped+wait_backfill

3 active+degraded

3 active+remapped+wait_backfill+backfill_toofull

1 active+remapped

recovery io 361 MB/s, 90 objects/s

client io 3611 kB/s wr, 0 op/s rd, 3 op/s wr

see? Only a small amount of PG’s is actually involved. Hopefully this change will result in osd’s 11 and 29 to become utilised a bit less and numbers 15 through 17 a bit more.

Indeed it made the situation slightly less bad:

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME

-1 60.77974 - 61433G 42538G 18895G 69.24 1.00 0 root default

-2 20.34990 - 20477G 14123G 6353G 68.97 1.00 0 host md002

18 1.84999 1.00000 1861G 1237G 624G 66.47 0.96 246 osd.18

19 1.84999 1.00000 1861G 1243G 617G 66.81 0.96 261 osd.19

20 1.84999 1.00000 1861G 1472G 389G 79.09 1.14 303 osd.20

21 1.84999 1.00000 1861G 1241G 619G 66.70 0.96 268 osd.21

22 1.84999 1.00000 1861G 1220G 640G 65.58 0.95 254 osd.22

23 1.84999 1.00000 1861G 1207G 654G 64.85 0.94 252 osd.23

24 1.84999 1.00000 1861G 1327G 533G 71.32 1.03 258 osd.24

25 1.84999 1.00000 1861G 1434G 426G 77.08 1.11 279 osd.25

26 1.84999 1.00000 1861G 1221G 639G 65.64 0.95 262 osd.26

27 1.84999 1.00000 1861G 1228G 632G 66.01 0.95 249 osd.27

28 1.84999 1.00000 1861G 1287G 574G 69.14 1.00 255 osd.28

-3 20.34990 - 20477G 14247G 6230G 69.58 1.00 0 host md008

0 1.84999 1.00000 1861G 1293G 568G 69.48 1.00 262 osd.0

3 1.84999 1.00000 1861G 1326G 535G 71.25 1.03 258 osd.3

5 1.84999 1.00000 1861G 1234G 627G 66.29 0.96 256 osd.5

7 1.84999 1.00000 1861G 1406G 455G 75.53 1.09 296 osd.7

10 1.84999 1.00000 1861G 1353G 507G 72.73 1.05 254 osd.10

1 1.84999 1.00000 1861G 1274G 587G 68.46 0.99 267 osd.1

6 1.84999 1.00000 1861G 1485G 375G 79.80 1.15 269 osd.6

8 1.84999 1.00000 1861G 1144G 717G 61.46 0.89 269 osd.8

9 1.84999 1.00000 1861G 1204G 657G 64.69 0.93 242 osd.9

2 1.84999 1.00000 1861G 1331G 529G 71.54 1.03 266 osd.2

4 1.84999 1.00000 1861G 1193G 668G 64.09 0.93 250 osd.4

-5 20.07994 - 20478G 14167G 6311G 69.18 1.00 0 host md005

14 2.73000 0.84999 2792G 1961G 831G 70.23 1.01 398 osd.14

15 2.73000 0.84999 2792G 1850G 942G 66.25 0.96 372 osd.15

16 2.73000 0.84999 2792G 1780G 1012G 63.76 0.92 366 osd.16

12 2.73000 0.79999 2792G 1941G 851G 69.51 1.00 392 osd.12

17 2.73000 0.84999 2792G 1669G 1122G 59.79 0.86 354 osd.17

13 2.73000 0.79999 2792G 1769G 1023G 63.36 0.92 369 osd.13

11 1.84999 1.00000 1861G 1601G 260G 86.03 1.24 315 osd.11

29 1.84998 1.00000 1861G 1593G 268G 85.59 1.24 332 osd.29

TOTAL 61433G 42538G 18895G 69.24

MIN/MAX VAR: 0.86/1.24 STDDEV: 6.47

The usage on 11 and 29 decreased and on 15, 16 and 17 it increased. Only tiny amounts, but the weighting change was tiny also. Lets make a bigger change:

$ sudo ceph osd reweight 16 0.90

$ sudo ceph osd reweight 17 0.95

$ sudo ceph osd reweight 13 0.90

and the result:

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME

-1 60.77974 - 61433G 42434G 18999G 69.07 1.00 0 root default

-2 20.34990 - 20477G 14098G 6379G 68.85 1.00 0 host md002

18 1.84999 1.00000 1861G 1207G 654G 64.85 0.94 244 osd.18

19 1.84999 1.00000 1861G 1270G 591G 68.25 0.99 265 osd.19

20 1.84999 1.00000 1861G 1459G 401G 78.42 1.14 300 osd.20

21 1.84999 1.00000 1861G 1213G 648G 65.19 0.94 266 osd.21

22 1.84999 1.00000 1861G 1220G 640G 65.59 0.95 252 osd.22

23 1.84999 1.00000 1861G 1213G 647G 65.20 0.94 253 osd.23

24 1.84999 1.00000 1861G 1314G 546G 70.63 1.02 257 osd.24

25 1.84999 1.00000 1861G 1433G 428G 77.00 1.11 278 osd.25

26 1.84999 1.00000 1861G 1225G 636G 65.83 0.95 262 osd.26

27 1.84999 1.00000 1861G 1243G 618G 66.78 0.97 251 osd.27

28 1.84999 1.00000 1861G 1295G 566G 69.60 1.01 254 osd.28

-3 20.34990 - 20477G 14230G 6246G 69.50 1.01 0 host md008

0 1.84999 1.00000 1861G 1295G 565G 69.61 1.01 260 osd.0

3 1.84999 1.00000 1861G 1343G 518G 72.17 1.04 257 osd.3

5 1.84999 1.00000 1861G 1225G 636G 65.82 0.95 257 osd.5

7 1.84999 1.00000 1861G 1409G 452G 75.71 1.10 298 osd.7

10 1.84999 1.00000 1861G 1339G 522G 71.93 1.04 252 osd.10

1 1.84999 1.00000 1861G 1269G 592G 68.19 0.99 264 osd.1

6 1.84999 1.00000 1861G 1496G 364G 80.39 1.16 269 osd.6

8 1.84999 1.00000 1861G 1145G 715G 61.55 0.89 270 osd.8

9 1.84999 1.00000 1861G 1204G 657G 64.68 0.94 242 osd.9

2 1.84999 1.00000 1861G 1330G 531G 71.45 1.03 268 osd.2

4 1.84999 1.00000 1861G 1171G 690G 62.93 0.91 245 osd.4

-5 20.07994 - 20478G 14105G 6373G 68.88 1.00 0 host md005

14 2.73000 0.84999 2792G 1864G 928G 66.76 0.97 378 osd.14

15 2.73000 0.84999 2792G 1720G 1071G 61.61 0.89 354 osd.15

16 2.73000 0.89999 2792G 1712G 1079G 61.33 0.89 368 osd.16

12 2.73000 0.79999 2792G 1890G 902G 67.69 0.98 376 osd.12

17 2.73000 0.95000 2792G 1870G 921G 66.99 0.97 395 osd.17

13 2.73000 0.89999 2792G 1942G 849G 69.57 1.01 400 osd.13

11 1.84999 1.00000 1861G 1511G 350G 81.18 1.18 301 osd.11

29 1.84998 1.00000 1861G 1592G 269G 85.55 1.24 318 osd.29

TOTAL 61433G 42434G 18999G 69.07

MIN/MAX VAR: 0.89/1.24 STDDEV: 6.08

so very slowly we’re getting there. some more tweaking and…..

cluster 6318a6a2-808b-45a1-9c89-31575c58de49

health HEALTH_WARN

1 near full osd(s)

monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}

election epoch 18842, quorum 0,1,2,3 md002,md005,md008,md010

osdmap e107725: 30 osds: 30 up, 30 in

pgmap v13170442: 2880 pgs, 5 pools, 13944 GB data, 3493 kobjects

42348 GB used, 19085 GB / 61433 GB avail

2879 active+clean

1 active+clean+scrubbing+deep

client io 50325 kB/s rd, 13882 kB/s wr, 24 op/s rd, 13 op/s wr

$ sudo ceph osd df tree

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME

-1 60.77974 - 61433G 42348G 19085G 68.93 1.00 0 root default

-2 20.34990 - 20477G 14096G 6381G 68.84 1.00 0 host md002

18 1.84999 1.00000 1861G 1191G 670G 64.00 0.93 242 osd.18

19 1.84999 1.00000 1861G 1287G 573G 69.18 1.00 266 osd.19

20 1.84999 1.00000 1861G 1433G 428G 77.00 1.12 298 osd.20

21 1.84999 1.00000 1861G 1210G 650G 65.04 0.94 264 osd.21

22 1.84999 1.00000 1861G 1211G 649G 65.10 0.94 253 osd.22

23 1.84999 1.00000 1861G 1242G 619G 66.72 0.97 255 osd.23

24 1.84999 1.00000 1861G 1316G 544G 70.74 1.03 258 osd.24

25 1.84999 1.00000 1861G 1411G 450G 75.82 1.10 277 osd.25

26 1.84999 1.00000 1861G 1235G 626G 66.35 0.96 261 osd.26

27 1.84999 1.00000 1861G 1258G 602G 67.62 0.98 252 osd.27

28 1.84999 1.00000 1861G 1296G 565G 69.63 1.01 254 osd.28

-3 20.34990 - 20477G 14193G 6284G 69.31 1.01 0 host md008

0 1.84999 1.00000 1861G 1298G 563G 69.76 1.01 260 osd.0

3 1.84999 1.00000 1861G 1333G 528G 71.61 1.04 256 osd.3

5 1.84999 1.00000 1861G 1224G 637G 65.77 0.95 256 osd.5

7 1.84999 1.00000 1861G 1395G 466G 74.96 1.09 297 osd.7

10 1.84999 1.00000 1861G 1344G 517G 72.20 1.05 254 osd.10

1 1.84999 1.00000 1861G 1269G 591G 68.22 0.99 264 osd.1

6 1.84999 1.00000 1861G 1484G 376G 79.76 1.16 268 osd.6

8 1.84999 1.00000 1861G 1136G 725G 61.02 0.89 270 osd.8

9 1.84999 1.00000 1861G 1204G 656G 64.71 0.94 242 osd.9

2 1.84999 1.00000 1861G 1330G 530G 71.48 1.04 268 osd.2

4 1.84999 1.00000 1861G 1171G 689G 62.95 0.91 245 osd.4

-5 20.07994 - 20478G 14059G 6419G 68.65 1.00 0 host md005

14 2.73000 0.84999 2792G 1826G 966G 65.39 0.95 374 osd.14

15 2.73000 0.89999 2792G 1841G 951G 65.94 0.96 371 osd.15

16 2.73000 0.95000 2792G 1783G 1008G 63.87 0.93 380 osd.16

12 2.73000 0.79999 2792G 1833G 958G 65.66 0.95 368 osd.12

17 2.73000 0.95000 2792G 1769G 1023G 63.36 0.92 381 osd.17

13 2.73000 0.89999 2792G 1903G 889G 68.15 0.99 395 osd.13

11 1.84999 1.00000 1861G 1485G 375G 79.81 1.16 298 osd.11

29 1.84998 1.00000 1861G 1616G 245G 86.81 1.26 313 osd.29

TOTAL 61433G 42348G 19085G 68.93

MIN/MAX VAR: 0.89/1.26 STDDEV: 5.86

So now the problem of the 2 PG’s that were active+undersized+degraded are now fixed. Still there is 1 osd that’s ‘nearly full’.

The manual tuning has been fun, but it can be done automatically also:

sudo ceph osd test-reweight-by-utilization

no change

moved 146 / 8640 (1.68981%)

avg 288

stddev 48.1311 -> 50.5021 (expected baseline 16.6853)

min osd.9 with 242 -> 245 pgs (0.840278 -> 0.850694 * mean)

max osd.13 with 395 -> 383 pgs (1.37153 -> 1.32986 * mean)

oload 120

max_change 0.05

max_change_osds 4

average 0.689340

overload 0.827209

osd.29 weight 1.000000 -> 0.950012

osd.17 weight 0.949997 -> 0.999985

osd.16 weight 0.949997 -> 0.999985

osd.14 weight 0.849991 -> 0.896011

that looks like a sensible change. Let’s apply it.

And then the result is…

cluster 6318a6a2-808b-45a1-9c89-31575c58de49

health HEALTH_OK

monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}

election epoch 18870, quorum 0,1,2,3 md002,md005,md008,md010

osdmap e108690: 30 osds: 30 up, 30 in

flags sortbitwise

pgmap v13455986: 2880 pgs, 5 pools, 14064 GB data, 3523 kobjects

42709 GB used, 18724 GB / 61433 GB avail

2879 active+clean

1 active+clean+scrubbing+deep

looking good, all done!

zondag 24 april 2016

Why I moved away from BackupPC for large data sets

In a previous post (http://shootingbits.blogspot.nl/2014/07/migrating-backuppc-from-ext4-to.html) I explained how our BackupPC storage was becoming too small and was having problems. We decided to switch to glusterFS so we could scale more easily.
We did all that, and it worked. Or so it seemed. GlusterFS is a filesystem on top of a filesystem. It is an additional layer over the normal filesystem that decides where to place files (whole files, not parts of them) and can rebalance the placement of those files and check for consistency. But because you (and therefore also your scripts, the OS, etc) still have access to the underlying filesystem, things can change under the nose of GlusterFS. That means that the consistency of the GlusterFS can't be guaranteed at any point in time. As long as you're very careful to always access the storage through Gluster, you should be fine. Even though we did that, we still ended up with some weird situations:
- Having BackupPC create hard links on Gluster was not working properly. I had to change the way BackupPC creates hard links to get it to work. The Perl command used by BackupPC seems to be incompatible with GlusterFS at that time. I've created a bug report at Gluster (https://bugzilla.redhat.com/show_bug.cgi?id=1156022) but as you'll read later it's no longer relevant for us. But I was able to work around it.
- After deleting data on GlusterFS, no space was freed. Even when the last file of a set of hard links was deleted the space did not become available. That meant we would need to keep adding storage, for ever.
- Moving to GlusterFS was hell. Not because of Gluster, but because of the way BackupPC stores data: All files are stored as files. if that file already exists, it creates a hardlink. In theory this is very efficient. Especially for large files. But since we also have tens of millions of small files, that means transferring the data from our old storage to our new storage was very heavily bound by iops. Each file would essentially become an iop. And with tens of millions of files stored in each backup moment, and many backups, that means hundreds of millions of files or links. With the data being read from a 12 disk raid-5 or raid-6 set of 7200rpm disks (which does about 500iops max for random io) it took weeks to transfer the data. Even though it was only about 15TB, the fact that each small files was a file or hardlink made the transfer very slow.

I wanted to replace BackupPC with a block based backup solution that would store the backup not in a file per file but in archives. We ended up with Bareos (www.bareos.com) and it is working well for us. Ofcourse it has its own weird things and limitations, but they are orders of magnitude less bad that BackupPC was for us. Mind you, BackupPC 3 is still a great tool and may work fine for your situation if you don't throw millions of files at it.

And we decided to ditch GlusterFS and moved to Ceph instead. Ceph is a scalable, distributed storage solution that can be used for openstack but will run fine on its own. It is in active development and was acquired by RedHat, which is also the owner of GlusterFS. My suspicion is that Ceph will replace GlusterFS. It works much better for us, although we had large performance problems with it running radios block devices with version 0.94 (hammer) on btrfs osd's. Once we replaced btrfs with xfs it became much more stable (odd's were no longer dropping out that often) and once we moved to 9.2.0 (the next release, infernalis) things became very stable.
We're not getting 10gbit throughput even though the infrastructure is there, the network is tuned for it and the disks and cpu's are not maxed out, but it's good enough for our needs. We'd like to achieve better throughput, but it's not worth the effort currently to dive in very deeply to figure out why we're not getting it. I suspect Ceph is designed for many parallel readers/writers and therefore not optimised for a single user wanting to suck 10gbit/sec out of it. We'll figure it out in the future.

And because we're not running Bareos as our main backup tool, which stores all the files in archives, syncing to our offsite backup solution has become much more easy. No longer we have to transfer hundreds of millions of files to there anymore either, only a number of 'large' archive files. That allows us to achieve a much higher throughput and have much lower iops requirements there also.

It's been a loooong journey to get here, with many lessons learned. But we now have a stable backup solution, hopefully for years to come.

vrijdag 18 december 2015

Fiio q1 review

This is just a very quick first impression of what i think of the fiio q1 after receiving it and teting it to make sure it''s not broken before it goes under the Christmas tree.

The package is a bit small and is decent but not great. This is definitely not giving me the feeling i just spent €90 on this thing. Unpacking shows the device is a bit smaller than i expected and very much lighter. It's soo light that it feels cheap, despite having a good touch to it. But the assembly is als a bit cheap looking. Again, not a great start.

I bought it because the sound from my dell 980sff is total crap. And my airbook isn't really giving the sound I'm hoping for either. I was using an Asus xonar u1 but my headphones are way too loud with it. I need to turn down the volume on the computer first through software down to the lowest level, and then also turn the u1 all the way down. This means I lose a lot of fidelity and it's still lightly too loud. It seems the fiio q1 doesn't do much better because I also need to keep the volume know all the way down (even in low gain mode), although the software volume on the machine can stay at the maximum (which is good for fidelity). So the q1 is an improvement for me over the u1, but whether it's enough is something i need to spend more time on. And I didn't yet really listen to the sound yet so I have no idea about sound quality either.

That will all be soon after Christmas i hope.

Update 2015-12-25
It's Christmas and I found myself with half an hour of time to do some comparison listening. I used flac's at 44.1Khz or 48KHz at 16bits, played with foobar2000 with WASAPI event setting to get data to the Fiio. I listened with a sennheiser hd558 and an akg k550. I'm comparing the sound to a realtek onboard audiochip: the ALC892 (http://www.realtek.com.tw/products/productsView.aspx?Langid=1&PFid=28&Level=5&Conn=4&ProdID=284) and the o2 dac-amp. Although I prefer to have a look at measurements when deciding on a new audio device, doing proper measurements is hard. And interpreting them is also difficult, if not more so. Have a read at http://nwavguy.blogspot.nl/2011/02/testing-methods.html to see what I mean.
Because I don't have expensive measurement gear at my disposal all I can go by is my ears. For me that's in the end the most important thing anyway, because that's what I'm using to listen and enjoy the music. If I happen to like a device with very bad measurements, so be it. But atleast I'll be honest about it :)
Anyway, back to the listening test. I don't like to use vague words to describe what I'm hearing but since there isn't a clear standard for describing differences in this area, I think I'll have to. I'll try to keep it 'real'. The Fiio Q1 has more air in the reproducted sound than the ALC892. It's slightly easier to discern the separate instruments from eachother, but the placement remains the same. There is just a bit more room between the instruments. It sounds less smudged or compacted. Because of this it sounds a bit more 'studio' or 'analytical'. I'm not hearing much more detail in the music, maybe a tiny little bit. There is also a little bit less depth and fun in the music. It's like the improved clarity has removed some of that.
In the end the most pronounced difference I heard with Michael Nyman's 'The heart asks pleasure first' where the fiio makes the music come more alive.

All of this is much less of a difference than I heard with the O2dac-amp, which gave me a real 'wow' effect compared to the ALC892. The Fiio Q1 doesn't do that, but it's an improvement above 'standard' audio, although not by much. The O2 dac-amp is a real winner here, and only slightly more expensive when you build your own. The Q1 however is much smaller, lighter and less finnicky (I had to repair my O2 twice already, that's DIY). And the Q1 is nicer to look at. If you go for sound though, get the O2. It gives the feeling of really being there when listening to the recording. The Q1 really sounds like a recording. That's great for listening on the go or as background music, but when actively listening, the O2 has better sound to my ears.

I was unable to find any 'real' measurements on the internet for the Q1 so far (I'm not talking about the data presented by Fiio, nor by any home using using RMAA). I'd be very interested to check whether what I think I'm hearing is backed up by data, or whether I'm simply affected by subjective bias (http://nwavguy.blogspot.nl/2011/03/dac-listening-challenge-results.html). Please let me know.

woensdag 1 oktober 2014

measurements while building the O2 objective headphone amp

While building the (amazing) O2 objective amp and dac I did some measurements before powering up the unit. This is strongly advised by the guides because it allows me to detect problems and mistakes before adding power, thus hopefully preventing costly burnouts or worse, explosions.
Unfortunately most of the measurements I made were way off from the ones listed in the guides. Here is what I measured:
R1&R2 476 instead of 100-220
R5 191k instead of 100k
R15/18/10/11 240 instead of 1
R16/22 1.55k instead of 1.3k
R7/14/3/20 500 instead of 100-300
(all values are in Ohms)

The measurement device I use is a somewhat expensive voltcraft digital multimeter with a high sampling rate. My suspicion is that this specific MultiMeter is outputting amps while measuring ohms without allowing the amps to drain out before doing another sample. That could result in a higher ohms measurement due to capacitance in the circuit. Unfortunately I did not have a different (digital) MM available to do any measurements with. After checking and triplechecking the placement of the components and the soldering joints I decided to try the unit.
And it works beautifully!

I compared it to all the amplifiers and dacs I have and for its cheap price it could only be matched by equipment 4 times its price or above. I find that to be very good value for money.

I hope this post helps if ever you build your own O2.