At 50-60% full, our ceph cluster was already saying 1 (and sometimes 2) osd’s were “near full”. That’s odd, for a cluster that’s only used 50-60%. But worse, there are 2 pg’s that are in degraded state because it could not write the 3rd replica. Maybe those problems are related? This situation started after created a new pool with two images in it back when we were on infernal.
details:
cluster 6318a6a2-808b-45a1-9c89-31575c58de49
health HEALTH_WARN
2 pgs degraded
2 pgs stuck unclean
2 pgs undersized
recovery 696/10319262 objects degraded (0.007%)
2 near full osd(s)
monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
election epoch 18824, quorum 0,1,2,3 md002,md005,md008,md010
osdmap e105351: 30 osds: 30 up, 30 in
pgmap v13008499: 2880 pgs, 5 pools, 13406 GB data, 3359 kobjects
40727 GB used, 20706 GB / 61433 GB avail
696/10319262 objects degraded (0.007%)
2878 active+clean
2 active+undersized+degraded
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.92989 root default
-2 22.20000 host md002
18 1.84999 osd.18 up 1.00000 1.00000
19 1.84999 osd.19 up 1.00000 1.00000
20 1.84999 osd.20 up 1.00000 1.00000
21 1.84999 osd.21 up 1.00000 1.00000
22 1.84999 osd.22 up 1.00000 1.00000
23 1.84999 osd.23 up 1.00000 1.00000
24 1.84999 osd.24 up 1.00000 1.00000
25 1.84999 osd.25 up 1.00000 1.00000
26 1.84999 osd.26 up 1.00000 1.00000
27 1.84999 osd.27 up 1.00000 1.00000
28 1.84999 osd.28 up 1.00000 1.00000
29 1.84999 osd.29 up 1.00000 1.00000
-3 22.20000 host md008
0 1.84999 osd.0 up 1.00000 1.00000
3 1.84999 osd.3 up 1.00000 1.00000
5 1.84999 osd.5 up 1.00000 1.00000
7 1.84999 osd.7 up 1.00000 1.00000
11 1.84999 osd.11 up 1.00000 1.00000
10 1.84999 osd.10 up 1.00000 1.00000
1 1.84999 osd.1 up 1.00000 1.00000
6 1.84999 osd.6 up 1.00000 1.00000
8 1.84999 osd.8 up 1.00000 1.00000
9 1.84999 osd.9 up 1.00000 1.00000
2 1.84999 osd.2 up 1.00000 1.00000
4 1.84999 osd.4 up 1.00000 1.00000
-5 16.37999 host md005
14 2.73000 osd.14 up 0.79999 1.00000
15 2.73000 osd.15 up 0.79999 1.00000
16 2.73000 osd.16 up 0.79999 1.00000
12 2.73000 osd.12 up 0.79999 1.00000
17 2.73000 osd.17 up 0.79999 1.00000
13 2.73000 osd.13 up 0.79999 1.00000
The osd tree shows that we have 3 machines, two of them with about 22TB and one with about 16TB. Ofcourse that means that we can only store about 16TB with a replication count of three if we want each pg to be placed on a separate machine. Currently we are using about 13,4TB. And that’s about 13,4/16=83% of the available capacity. That’s becoming close to a problem.
Because not all pg’s are of equal size (it seems), some disk are also used a lot more than others:
[root@md010 ~]# sudo ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
18 1.84999 1.00000 1861G 1166G 695G 62.64 0.97 238
19 1.84999 1.00000 1861G 1075G 785G 57.78 0.89 223
20 1.84999 1.00000 1861G 1160G 701G 62.32 0.96 278
21 1.84999 1.00000 1861G 972G 888G 52.26 0.81 230
22 1.84999 1.00000 1861G 1073G 788G 57.65 0.89 235
23 1.84999 1.00000 1861G 1077G 784G 57.87 0.89 242
24 1.84999 1.00000 1861G 1135G 725G 61.00 0.94 226
25 1.84999 1.00000 1861G 1154G 707G 62.01 0.96 245
26 1.84999 1.00000 1861G 1096G 764G 58.91 0.91 243
27 1.84999 1.00000 1861G 1080G 781G 58.03 0.90 234
28 1.84999 1.00000 1861G 1036G 825G 55.66 0.86 237
29 1.84999 1.00000 1861G 1224G 637G 65.76 1.02 249
0 1.84999 1.00000 1861G 1146G 714G 61.61 0.95 242
3 1.84999 1.00000 1861G 1101G 760G 59.16 0.91 237
5 1.84999 1.00000 1861G 1122G 739G 60.29 0.93 246
7 1.84999 1.00000 1861G 1128G 732G 60.64 0.94 249
11 1.84999 1.00000 1861G 1115G 746G 59.92 0.93 226
10 1.84999 1.00000 1861G 1127G 733G 60.58 0.94 229
1 1.84999 1.00000 1861G 1090G 771G 58.57 0.90 239
6 1.84999 1.00000 1861G 1248G 612G 67.09 1.04 241
8 1.84999 1.00000 1861G 902G 959G 48.47 0.75 243
9 1.84999 1.00000 1861G 987G 874G 53.04 0.82 217
2 1.84999 1.00000 1861G 1252G 608G 67.31 1.04 259
4 1.84999 1.00000 1861G 1125G 735G 60.47 0.93 252
14 2.73000 0.79999 2792G 2321G 471G 83.13 1.28 489
15 2.73000 0.79999 2792G 2142G 649G 76.73 1.19 474
16 2.73000 0.79999 2792G 2081G 711G 74.52 1.15 460
12 2.73000 0.79999 2792G 2375G 416G 85.08 1.31 494
17 2.73000 0.79999 2792G 1947G 845G 69.72 1.08 462
13 2.73000 0.79999 2792G 2307G 484G 82.64 1.28 499
TOTAL 61433G 39778G 21655G 64.75
MIN/MAX VAR: 0.75/1.31 STDDEV: 8.65
So indeed, some disks on md010 (which only has 16TB) are already 85% full.
Since the 3rd machine has some free disk slots I decided to move a disk from md002 and md008 each to md005. That will reduce space on md002 and md008 and add it to md005. Each node will then have 20TB. That should increase the maximum capacity we can store from 16TB to 20TB and it should decrease the used percentage from 83% to 13,4/20=67%. Much better already.
Doing this was easy:
stop the osd
unmount it
identify the physical disk (map the osd nr to the device id to the virtual disk to the physical disk and set blinking on)
move it physically
import the foreign disk on md005
mount it
start the osd
and repeat for the other osd.
The osd tree then looks like this:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 60.77974 root default
-2 20.34990 host md002
18 1.84999 osd.18 up 1.00000 1.00000
19 1.84999 osd.19 up 1.00000 1.00000
20 1.84999 osd.20 up 1.00000 1.00000
21 1.84999 osd.21 up 1.00000 1.00000
22 1.84999 osd.22 up 1.00000 1.00000
23 1.84999 osd.23 up 1.00000 1.00000
24 1.84999 osd.24 up 1.00000 1.00000
25 1.84999 osd.25 up 1.00000 1.00000
26 1.84999 osd.26 up 1.00000 1.00000
27 1.84999 osd.27 up 1.00000 1.00000
28 1.84999 osd.28 up 1.00000 1.00000
-3 20.34990 host md008
0 1.84999 osd.0 up 1.00000 1.00000
3 1.84999 osd.3 up 1.00000 1.00000
5 1.84999 osd.5 up 1.00000 1.00000
7 1.84999 osd.7 up 1.00000 1.00000
10 1.84999 osd.10 up 1.00000 1.00000
1 1.84999 osd.1 up 1.00000 1.00000
6 1.84999 osd.6 up 1.00000 1.00000
8 1.84999 osd.8 up 1.00000 1.00000
9 1.84999 osd.9 up 1.00000 1.00000
2 1.84999 osd.2 up 1.00000 1.00000
4 1.84999 osd.4 up 1.00000 1.00000
-5 20.07994 host md005
14 2.73000 osd.14 up 0.79999 1.00000
15 2.73000 osd.15 up 0.79999 1.00000
16 2.73000 osd.16 up 0.79999 1.00000
12 2.73000 osd.12 up 0.79999 1.00000
17 2.73000 osd.17 up 0.79999 1.00000
13 2.73000 osd.13 up 0.79999 1.00000
11 1.84999 osd.11 up 1.00000 1.00000
29 1.84998 osd.29 up 1.00000 1.00000
Since a lot of data has now moved and some pg’s may be stored on md005 twice now, lots of data will need to move and everything will rebalance. This is what that looks like:
cluster 6318a6a2-808b-45a1-9c89-31575c58de49
health HEALTH_WARN
13 pgs backfill_toofull
214 pgs backfill_wait
71 pgs backfilling
738 pgs degraded
52 pgs recovering
610 pgs recovery_wait
876 pgs stuck unclean
76 pgs undersized
recovery 774247/11332909 objects degraded (6.832%)
recovery 2100222/11332909 objects misplaced (18.532%)
2 near full osd(s)
monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
election epoch 18834, quorum 0,1,2,3 md002,md005,md008,md010
osdmap e105775: 30 osds: 30 up, 30 in; 292 remapped pgs
pgmap v13024840: 2880 pgs, 5 pools, 13490 GB data, 3380 kobjects
41515 GB used, 19918 GB / 61433 GB avail
774247/11332909 objects degraded (6.832%)
2100222/11332909 objects misplaced (18.532%)
1933 active+clean
603 active+recovery_wait+degraded
171 active+remapped+wait_backfill
52 active+recovering+degraded
45 active+undersized+degraded+remapped+backfilling
30 active+undersized+degraded+remapped+wait_backfill
26 active+remapped+backfilling
12 active+remapped+wait_backfill+backfill_toofull
7 active+recovery_wait+degraded+remapped
1 active+undersized+degraded+remapped+wait_backfill+backfill_toofull
recovery io 351 MB/s, 88 objects/s
client io 1614 kB/s rd, 158 kB/s wr, 0 op/s rd, 0 op/s wr
during the recovery:
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
18 1.84999 1.00000 1861G 1311G 549G 70.47 1.04 257
19 1.84999 1.00000 1861G 1189G 671G 63.92 0.94 268
20 1.84999 1.00000 1861G 1360G 501G 73.06 1.08 307
21 1.84999 1.00000 1861G 1217G 644G 65.39 0.96 279
22 1.84999 1.00000 1861G 1254G 606G 67.41 0.99 263
23 1.84999 1.00000 1861G 1184G 676G 63.64 0.94 262
24 1.84999 1.00000 1861G 1245G 616G 66.90 0.99 264
25 1.84999 1.00000 1861G 1292G 568G 69.44 1.02 285
26 1.84999 1.00000 1861G 1239G 621G 66.61 0.98 271
27 1.84999 1.00000 1861G 1210G 651G 65.01 0.96 259
28 1.84999 1.00000 1861G 1204G 656G 64.73 0.95 259
0 1.84999 1.00000 1861G 1218G 643G 65.43 0.96 272
3 1.84999 1.00000 1861G 1177G 684G 63.25 0.93 262
5 1.84999 1.00000 1861G 1175G 685G 63.17 0.93 265
7 1.84999 1.00000 1861G 1270G 591G 68.24 1.01 307
10 1.84999 1.00000 1861G 1162G 699G 62.44 0.92 262
1 1.84999 1.00000 1861G 1200G 660G 64.51 0.95 274
6 1.84999 1.00000 1861G 1271G 590G 68.28 1.01 277
8 1.84999 1.00000 1861G 1024G 836G 55.04 0.81 273
9 1.84999 1.00000 1861G 1101G 760G 59.16 0.87 250
2 1.84999 1.00000 1861G 1291G 570G 69.38 1.02 276
4 1.84999 1.00000 1861G 1167G 694G 62.70 0.92 264
14 2.73000 0.79999 2792G 2363G 428G 84.64 1.25 429
15 2.73000 0.79999 2792G 2189G 603G 78.39 1.15 398
16 2.73000 0.79999 2792G 2169G 622G 77.70 1.14 399
12 2.73000 0.79999 2792G 2388G 404G 85.52 1.26 452
17 2.73000 0.79999 2792G 2072G 720G 74.21 1.09 385
13 2.73000 0.79999 2792G 2321G 471G 83.13 1.22 425
11 1.84999 1.00000 1861G 1071G 790G 57.54 0.85 348
29 1.84998 1.00000 1861G 354G 1506G 19.06 0.28 353
TOTAL 61433G 41703G 19730G 67.88
MIN/MAX VAR: 0.28/1.26 STDDEV: 11.50
and once it was done, the result is a very surprising:
cluster 6318a6a2-808b-45a1-9c89-31575c58de49
health HEALTH_WARN
13 pgs backfill_toofull
6 pgs backfilling
19 pgs stuck unclean
recovery 120/10824993 objects degraded (0.001%)
recovery 184018/10824993 objects misplaced (1.700%)
2 near full osd(s)
monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010
osdmap e107178: 30 osds: 30 up, 30 in; 19 remapped pgs
pgmap v13145938: 2880 pgs, 5 pools, 13936 GB data, 3491 kobjects
42471 GB used, 18962 GB / 61433 GB avail
120/10824993 objects degraded (0.001%)
184018/10824993 objects misplaced (1.700%)
2857 active+clean
13 active+remapped+backfill_toofull
6 active+remapped+backfilling
4 active+clean+scrubbing
So apparently, despite having a better balance and beside that we should be able to store more, we actually have more pg’s with problems now.
When taking a closer look at the individual osd’s:
$ sudo ceph osd df tree
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 60.77974 - 61433G 42403G 19030G 69.02 1.00 0 root default
-2 20.34990 - 20477G 14093G 6383G 68.83 1.00 0 host md002
18 1.84999 1.00000 1861G 1250G 611G 67.16 0.97 247 osd.18
19 1.84999 1.00000 1861G 1248G 613G 67.04 0.97 259 osd.19
20 1.84999 1.00000 1861G 1446G 415G 77.68 1.13 301 osd.20
21 1.84999 1.00000 1861G 1246G 615G 66.96 0.97 271 osd.21
22 1.84999 1.00000 1861G 1197G 664G 64.32 0.93 253 osd.22
23 1.84999 1.00000 1861G 1221G 640G 65.61 0.95 256 osd.23
24 1.84999 1.00000 1861G 1312G 548G 70.51 1.02 257 osd.24
25 1.84999 1.00000 1861G 1434G 426G 77.08 1.12 279 osd.25
26 1.84999 1.00000 1861G 1233G 627G 66.27 0.96 261 osd.26
27 1.84999 1.00000 1861G 1215G 645G 65.31 0.95 248 osd.27
28 1.84999 1.00000 1861G 1287G 574G 69.15 1.00 255 osd.28
-3 20.34990 - 20477G 14215G 6261G 69.42 1.01 0 host md008
0 1.84999 1.00000 1861G 1305G 555G 70.13 1.02 265 osd.0
3 1.84999 1.00000 1861G 1324G 537G 71.13 1.03 257 osd.3
5 1.84999 1.00000 1861G 1210G 650G 65.05 0.94 255 osd.5
7 1.84999 1.00000 1861G 1402G 459G 75.34 1.09 295 osd.7
10 1.84999 1.00000 1861G 1365G 496G 73.33 1.06 254 osd.10
1 1.84999 1.00000 1861G 1260G 600G 67.73 0.98 267 osd.1
6 1.84999 1.00000 1861G 1502G 358G 80.73 1.17 272 osd.6
8 1.84999 1.00000 1861G 1144G 717G 61.45 0.89 269 osd.8
9 1.84999 1.00000 1861G 1176G 685G 63.18 0.92 240 osd.9
2 1.84999 1.00000 1861G 1328G 532G 71.38 1.03 266 osd.2
4 1.84999 1.00000 1861G 1194G 667G 64.17 0.93 251 osd.4
-5 20.07994 - 20478G 14093G 6385G 68.82 1.00 0 host md005
14 2.73000 0.84999 2792G 1861G 931G 66.64 0.97 392 osd.14
15 2.73000 0.79999 2792G 1769G 1022G 63.37 0.92 354 osd.15
16 2.73000 0.79999 2792G 1722G 1070G 61.68 0.89 353 osd.16
12 2.73000 0.79999 2792G 2016G 775G 72.22 1.05 405 osd.12
17 2.73000 0.79999 2792G 1647G 1145G 59.00 0.85 349 osd.17
13 2.73000 0.79999 2792G 1832G 960G 65.60 0.95 378 osd.13
11 1.84999 1.00000 1861G 1627G 233G 87.44 1.27 326 osd.11
29 1.84998 1.00000 1861G 1616G 245G 86.83 1.26 345 osd.29
TOTAL 61433G 42403G 19030G 69.02
MIN/MAX VAR: 0.85/1.27 STDDEV: 6.91
We can see that the disks that got moved to md005 (osd’s 11 and 29) ore now almost full, while the other disks in md005, which are much larger, don’t have that much usage at all. We can also see that the larger disks have a lower reweighs, so less data is going there. probably that is no longer needed. For the least used osd’s in md005 I’ll go and increase the reweighs so more data will go there.
There are 2 ways to reweighs:
ceph crush reweigh
ceph osd reweight
Since our osd reweighs is not the default ‘1’, I’ll bring those closer to 1. Also because that setting is apparently not persistent and when an osd goes ‘out’, the setting is lost. More information on that can be found here: http://ceph.com/planet/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight/
$ sudo ceph osd reweight 15 0.85
reweighted osd.15 to 0.85 (d999)
$ sudo ceph osd reweight 16 0.85
reweighted osd.16 to 0.85 (d999)
$ sudo ceph osd reweight 17 0.85
reweighted osd.17 to 0.85 (d999)
I only reweighed the 3 osd’s that have least data on them, and only increased the weight by 0.05. I did this because I wanted to see the effects before taking the next step, and increasing by 0.05 shouldn’t take that many hours to re-align all the PG’s that the crush map now calculated a different place for.
cluster 6318a6a2-808b-45a1-9c89-31575c58de49
health HEALTH_WARN
clock skew detected on mon.md010
16 pgs backfill_toofull
15 pgs backfill_wait
22 pgs backfilling
51 pgs degraded
23 pgs recovering
25 pgs recovery_wait
56 pgs stuck unclean
recovery 48630/10962135 objects degraded (0.444%)
recovery 458174/10962135 objects misplaced (4.180%)
2 near full osd(s)
Monitor clock skew detected
monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010
osdmap e107399: 30 osds: 30 up, 30 in; 51 remapped pgs
pgmap v13147453: 2880 pgs, 5 pools, 13937 GB data, 3491 kobjects
42445 GB used, 18988 GB / 61433 GB avail
48630/10962135 objects degraded (0.444%)
458174/10962135 objects misplaced (4.180%)
2778 active+clean
25 active+recovery_wait+degraded
23 active+recovering+degraded
22 active+remapped+backfilling
13 active+remapped+backfill_toofull
12 active+remapped+wait_backfill
3 active+degraded
3 active+remapped+wait_backfill+backfill_toofull
1 active+remapped
recovery io 361 MB/s, 90 objects/s
client io 3611 kB/s wr, 0 op/s rd, 3 op/s wr
see? Only a small amount of PG’s is actually involved. Hopefully this change will result in osd’s 11 and 29 to become utilised a bit less and numbers 15 through 17 a bit more.
<some time later>
Indeed it made the situation slightly less bad:
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 60.77974 - 61433G 42538G 18895G 69.24 1.00 0 root default
-2 20.34990 - 20477G 14123G 6353G 68.97 1.00 0 host md002
18 1.84999 1.00000 1861G 1237G 624G 66.47 0.96 246 osd.18
19 1.84999 1.00000 1861G 1243G 617G 66.81 0.96 261 osd.19
20 1.84999 1.00000 1861G 1472G 389G 79.09 1.14 303 osd.20
21 1.84999 1.00000 1861G 1241G 619G 66.70 0.96 268 osd.21
22 1.84999 1.00000 1861G 1220G 640G 65.58 0.95 254 osd.22
23 1.84999 1.00000 1861G 1207G 654G 64.85 0.94 252 osd.23
24 1.84999 1.00000 1861G 1327G 533G 71.32 1.03 258 osd.24
25 1.84999 1.00000 1861G 1434G 426G 77.08 1.11 279 osd.25
26 1.84999 1.00000 1861G 1221G 639G 65.64 0.95 262 osd.26
27 1.84999 1.00000 1861G 1228G 632G 66.01 0.95 249 osd.27
28 1.84999 1.00000 1861G 1287G 574G 69.14 1.00 255 osd.28
-3 20.34990 - 20477G 14247G 6230G 69.58 1.00 0 host md008
0 1.84999 1.00000 1861G 1293G 568G 69.48 1.00 262 osd.0
3 1.84999 1.00000 1861G 1326G 535G 71.25 1.03 258 osd.3
5 1.84999 1.00000 1861G 1234G 627G 66.29 0.96 256 osd.5
7 1.84999 1.00000 1861G 1406G 455G 75.53 1.09 296 osd.7
10 1.84999 1.00000 1861G 1353G 507G 72.73 1.05 254 osd.10
1 1.84999 1.00000 1861G 1274G 587G 68.46 0.99 267 osd.1
6 1.84999 1.00000 1861G 1485G 375G 79.80 1.15 269 osd.6
8 1.84999 1.00000 1861G 1144G 717G 61.46 0.89 269 osd.8
9 1.84999 1.00000 1861G 1204G 657G 64.69 0.93 242 osd.9
2 1.84999 1.00000 1861G 1331G 529G 71.54 1.03 266 osd.2
4 1.84999 1.00000 1861G 1193G 668G 64.09 0.93 250 osd.4
-5 20.07994 - 20478G 14167G 6311G 69.18 1.00 0 host md005
14 2.73000 0.84999 2792G 1961G 831G 70.23 1.01 398 osd.14
15 2.73000 0.84999 2792G 1850G 942G 66.25 0.96 372 osd.15
16 2.73000 0.84999 2792G 1780G 1012G 63.76 0.92 366 osd.16
12 2.73000 0.79999 2792G 1941G 851G 69.51 1.00 392 osd.12
17 2.73000 0.84999 2792G 1669G 1122G 59.79 0.86 354 osd.17
13 2.73000 0.79999 2792G 1769G 1023G 63.36 0.92 369 osd.13
11 1.84999 1.00000 1861G 1601G 260G 86.03 1.24 315 osd.11
29 1.84998 1.00000 1861G 1593G 268G 85.59 1.24 332 osd.29
TOTAL 61433G 42538G 18895G 69.24
MIN/MAX VAR: 0.86/1.24 STDDEV: 6.47
The usage on 11 and 29 decreased and on 15, 16 and 17 it increased. Only tiny amounts, but the weighting change was tiny also. Lets make a bigger change:
$ sudo ceph osd reweight 16 0.90
$ sudo ceph osd reweight 17 0.95
$ sudo ceph osd reweight 13 0.90
and the result:
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 60.77974 - 61433G 42434G 18999G 69.07 1.00 0 root default
-2 20.34990 - 20477G 14098G 6379G 68.85 1.00 0 host md002
18 1.84999 1.00000 1861G 1207G 654G 64.85 0.94 244 osd.18
19 1.84999 1.00000 1861G 1270G 591G 68.25 0.99 265 osd.19
20 1.84999 1.00000 1861G 1459G 401G 78.42 1.14 300 osd.20
21 1.84999 1.00000 1861G 1213G 648G 65.19 0.94 266 osd.21
22 1.84999 1.00000 1861G 1220G 640G 65.59 0.95 252 osd.22
23 1.84999 1.00000 1861G 1213G 647G 65.20 0.94 253 osd.23
24 1.84999 1.00000 1861G 1314G 546G 70.63 1.02 257 osd.24
25 1.84999 1.00000 1861G 1433G 428G 77.00 1.11 278 osd.25
26 1.84999 1.00000 1861G 1225G 636G 65.83 0.95 262 osd.26
27 1.84999 1.00000 1861G 1243G 618G 66.78 0.97 251 osd.27
28 1.84999 1.00000 1861G 1295G 566G 69.60 1.01 254 osd.28
-3 20.34990 - 20477G 14230G 6246G 69.50 1.01 0 host md008
0 1.84999 1.00000 1861G 1295G 565G 69.61 1.01 260 osd.0
3 1.84999 1.00000 1861G 1343G 518G 72.17 1.04 257 osd.3
5 1.84999 1.00000 1861G 1225G 636G 65.82 0.95 257 osd.5
7 1.84999 1.00000 1861G 1409G 452G 75.71 1.10 298 osd.7
10 1.84999 1.00000 1861G 1339G 522G 71.93 1.04 252 osd.10
1 1.84999 1.00000 1861G 1269G 592G 68.19 0.99 264 osd.1
6 1.84999 1.00000 1861G 1496G 364G 80.39 1.16 269 osd.6
8 1.84999 1.00000 1861G 1145G 715G 61.55 0.89 270 osd.8
9 1.84999 1.00000 1861G 1204G 657G 64.68 0.94 242 osd.9
2 1.84999 1.00000 1861G 1330G 531G 71.45 1.03 268 osd.2
4 1.84999 1.00000 1861G 1171G 690G 62.93 0.91 245 osd.4
-5 20.07994 - 20478G 14105G 6373G 68.88 1.00 0 host md005
14 2.73000 0.84999 2792G 1864G 928G 66.76 0.97 378 osd.14
15 2.73000 0.84999 2792G 1720G 1071G 61.61 0.89 354 osd.15
16 2.73000 0.89999 2792G 1712G 1079G 61.33 0.89 368 osd.16
12 2.73000 0.79999 2792G 1890G 902G 67.69 0.98 376 osd.12
17 2.73000 0.95000 2792G 1870G 921G 66.99 0.97 395 osd.17
13 2.73000 0.89999 2792G 1942G 849G 69.57 1.01 400 osd.13
11 1.84999 1.00000 1861G 1511G 350G 81.18 1.18 301 osd.11
29 1.84998 1.00000 1861G 1592G 269G 85.55 1.24 318 osd.29
TOTAL 61433G 42434G 18999G 69.07
MIN/MAX VAR: 0.89/1.24 STDDEV: 6.08
so very slowly we’re getting there. some more tweaking and…..
cluster 6318a6a2-808b-45a1-9c89-31575c58de49
health HEALTH_WARN
1 near full osd(s)
monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
election epoch 18842, quorum 0,1,2,3 md002,md005,md008,md010
osdmap e107725: 30 osds: 30 up, 30 in
pgmap v13170442: 2880 pgs, 5 pools, 13944 GB data, 3493 kobjects
42348 GB used, 19085 GB / 61433 GB avail
2879 active+clean
1 active+clean+scrubbing+deep
client io 50325 kB/s rd, 13882 kB/s wr, 24 op/s rd, 13 op/s wr
$ sudo ceph osd df tree
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 60.77974 - 61433G 42348G 19085G 68.93 1.00 0 root default
-2 20.34990 - 20477G 14096G 6381G 68.84 1.00 0 host md002
18 1.84999 1.00000 1861G 1191G 670G 64.00 0.93 242 osd.18
19 1.84999 1.00000 1861G 1287G 573G 69.18 1.00 266 osd.19
20 1.84999 1.00000 1861G 1433G 428G 77.00 1.12 298 osd.20
21 1.84999 1.00000 1861G 1210G 650G 65.04 0.94 264 osd.21
22 1.84999 1.00000 1861G 1211G 649G 65.10 0.94 253 osd.22
23 1.84999 1.00000 1861G 1242G 619G 66.72 0.97 255 osd.23
24 1.84999 1.00000 1861G 1316G 544G 70.74 1.03 258 osd.24
25 1.84999 1.00000 1861G 1411G 450G 75.82 1.10 277 osd.25
26 1.84999 1.00000 1861G 1235G 626G 66.35 0.96 261 osd.26
27 1.84999 1.00000 1861G 1258G 602G 67.62 0.98 252 osd.27
28 1.84999 1.00000 1861G 1296G 565G 69.63 1.01 254 osd.28
-3 20.34990 - 20477G 14193G 6284G 69.31 1.01 0 host md008
0 1.84999 1.00000 1861G 1298G 563G 69.76 1.01 260 osd.0
3 1.84999 1.00000 1861G 1333G 528G 71.61 1.04 256 osd.3
5 1.84999 1.00000 1861G 1224G 637G 65.77 0.95 256 osd.5
7 1.84999 1.00000 1861G 1395G 466G 74.96 1.09 297 osd.7
10 1.84999 1.00000 1861G 1344G 517G 72.20 1.05 254 osd.10
1 1.84999 1.00000 1861G 1269G 591G 68.22 0.99 264 osd.1
6 1.84999 1.00000 1861G 1484G 376G 79.76 1.16 268 osd.6
8 1.84999 1.00000 1861G 1136G 725G 61.02 0.89 270 osd.8
9 1.84999 1.00000 1861G 1204G 656G 64.71 0.94 242 osd.9
2 1.84999 1.00000 1861G 1330G 530G 71.48 1.04 268 osd.2
4 1.84999 1.00000 1861G 1171G 689G 62.95 0.91 245 osd.4
-5 20.07994 - 20478G 14059G 6419G 68.65 1.00 0 host md005
14 2.73000 0.84999 2792G 1826G 966G 65.39 0.95 374 osd.14
15 2.73000 0.89999 2792G 1841G 951G 65.94 0.96 371 osd.15
16 2.73000 0.95000 2792G 1783G 1008G 63.87 0.93 380 osd.16
12 2.73000 0.79999 2792G 1833G 958G 65.66 0.95 368 osd.12
17 2.73000 0.95000 2792G 1769G 1023G 63.36 0.92 381 osd.17
13 2.73000 0.89999 2792G 1903G 889G 68.15 0.99 395 osd.13
11 1.84999 1.00000 1861G 1485G 375G 79.81 1.16 298 osd.11
29 1.84998 1.00000 1861G 1616G 245G 86.81 1.26 313 osd.29
TOTAL 61433G 42348G 19085G 68.93
MIN/MAX VAR: 0.89/1.26 STDDEV: 5.86
So now the problem of the 2 PG’s that were active+undersized+degraded are now fixed. Still there is 1 osd that’s ‘nearly full’.
The manual tuning has been fun, but it can be done automatically also:
sudo ceph osd test-reweight-by-utilization
no change
moved 146 / 8640 (1.68981%)
avg 288
stddev 48.1311 -> 50.5021 (expected baseline 16.6853)
min osd.9 with 242 -> 245 pgs (0.840278 -> 0.850694 * mean)
max osd.13 with 395 -> 383 pgs (1.37153 -> 1.32986 * mean)
oload 120
max_change 0.05
max_change_osds 4
average 0.689340
overload 0.827209
osd.29 weight 1.000000 -> 0.950012
osd.17 weight 0.949997 -> 0.999985
osd.16 weight 0.949997 -> 0.999985
osd.14 weight 0.849991 -> 0.896011
that looks like a sensible change. Let’s apply it.
And then the result is…
cluster 6318a6a2-808b-45a1-9c89-31575c58de49
health HEALTH_OK
monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
election epoch 18870, quorum 0,1,2,3 md002,md005,md008,md010
osdmap e108690: 30 osds: 30 up, 30 in
flags sortbitwise
pgmap v13455986: 2880 pgs, 5 pools, 14064 GB data, 3523 kobjects
42709 GB used, 18724 GB / 61433 GB avail
2879 active+clean
1 active+clean+scrubbing+deep
looking good, all done!