woensdag 28 september 2016

Setting dell 11g power management profile from cli

We want to lower our power usage because of cost reasons and respect for the environment. The company I work for has enough machines to render doing this change manually not a good idea. It would mean scheduling downtime for the machine in the monitoring, draining it of traffic in the loadbalancer, rebooting it, waiting for the POST screen to enter the bios, change the setting, exit, reboot again, and give the machine traffic again. Lots of steps, error prone, not fun work at all.
Luckily Dell has the 'iDrac' that allows us to manage server settings such as this. Unfortunately the iDrac interface only supports changing BIOS settings since iDrac 7 (http://en.community.dell.com/techcenter/b/techcenter/archive/2013/01/04/idrac7-now-support-configuring-server-idrac-bios-perc-and-nic-using-xml-file-and-racadm).

We have a bunch of 12g and 13g servers, but most of them are 11g (iDrac 6) so using racadm to change the BIOS would be only a partial solution.

But there is also the lifecycle controller. This can be used to achieve the same result, but also for the 11th generation servers. Here I show how to do it for an m610 (but all dell's should be equal in this area):

git clone https://github.com/dell/recite.git
cd recite
python recite.py IP=root@<host.domain.tld>
then enter your iDrac password. Or if you want to script this, you could use
python recite.py IP=root:password@<host.domain.tld>

That will put you into recite's command mode. To get the current value of the power management setting you do this:

--> GetBIOSEnumeration InstanceID=BIOS.Setup.1-1:PowerMgmt
Wed Sep 28 11:59:36 2016: GetBIOSEnumeration InstanceID=BIOS.Setup.1-1:PowerMgmt
wsman get "http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSEnumeration?InstanceID=BIOS.Setup.1-1:PowerMgmt" -h xxxxx.xxx.xxx -P 443 -u root -p ****** -V -v -c dummy.cert -j utf-8 -y basic

DCIM_BIOSEnumeration

  AttributeName = PowerMgmt
  CurrentValue = MaxPerf
  DefaultValue
  FQDD = BIOS.Setup.1-1
  InstanceID = BIOS.Setup.1-1:PowerMgmt
  IsReadOnly = false
  PendingValue
  PossibleValues = OsCtrl
  PossibleValues = ActivePwrCtrl
  PossibleValues = Custom
  PossibleValues = MaxPerf

--> 


Then to change the setting, do this:

--> SetBIOSAttribute Target=BIOS.Setup.1-1 AttributeName=PowerMgmt AttributeValue=OsCtrl
Wed Sep 28 12:37:04 2016: SetBIOSAttribute Target=BIOS.Setup.1-1 AttributeName=PowerMgmt AttributeValue=OsCtrl
wsman invoke -a SetAttribute "http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSService?SystemCreationClassName=DCIM_ComputerSystem,CreationClassName=DCIM_BIOSService,SystemName=DCIM:ComputerSystem,Name=DCIM:BIOSService" -k "AttributeName=PowerMgmt" -k "AttributeValue=OsCtrl" -k "Target=BIOS.Setup.1-1" -h xxxxx.xxx.xxx -P 443 -u root -p ****** -V -v -c dummy.cert -j utf-8 -y basic

SetAttribute_OUTPUT

  Message = The command was successful
  MessageID = BIOS001
  RebootRequired = Yes
  ReturnValue = 0
  SetResult = Set PendingValue

--> 


This will have changed the value, but it's *not applied yet*. It's 'pending'. For this a reboot is needed. You can trigger such a reboot through your orchestration tool (like Mcollective, pssh, etc), through IPMI and it should then apply the change (see below). But it can also be done through the lifecycle controller. To do this, a reboot job needs to be created:
--> CreateBIOSConfigJob Target=BIOS.Setup.1-1 ScheduledStartTime=TIME_NOW RebootJobType=1
Wed Sep 28 14:54:53 2016: CreateBIOSConfigJob Target=BIOS.Setup.1-1 ScheduledStartTime=TIME_NOW RebootJobType=1
wsman invoke -a CreateTargetedConfigJob "http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSService?SystemCreationClassName=DCIM_ComputerSystem,CreationClassName=DCIM_BIOSService,SystemName=DCIM:ComputerSystem,Name=DCIM:BIOSService" -k "RebootJobType=1" -k "ScheduledStartTime=TIME_NOW" -k "Target=BIOS.Setup.1-1" -h xxxxx.xxx.xxx -P 443 -u root -p ****** -V -v -c dummy.cert -j utf-8 -y basic

CreateTargetedConfigJob_OUTPUT

  ReturnValue = 4096
  Job
    EndpointReference
      Address = https://127.0.0.1:443/wsman
      ReferenceParameters
        ResourceURI = http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_LifecycleJob
        SelectorSet
          __cimnamespace = root/dcim
          InstanceID = JID_001475067508

-->


If all is well, the machine will reboot within a few seconds. Sometimes all is not well (jobs stuck in the queue, bugs, etc) and it may take a (long) while fore the LC actually picks up and executes the job. You'll be better off taking matters into your own (scripted) hands.
If you have the iDrac console open you'll see the server boot into the USC (the lifecyclecontroller interface), update the setting (it says it can take up to 10 minutes !!) and reboot again. Be patient.
After the reboot, you can check whether the value was correctly changed (not during the POST, this value is only updated once the OS is booting or has booted fully):

--> GetBIOSEnumeration InstanceID=BIOS.Setup.1-1:PowerMgmt
Wed Sep 28 15:10:10 2016: GetBIOSEnumeration InstanceID=BIOS.Setup.1-1:PowerMgmt
wsman get "http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSEnumeration?InstanceID=BIOS.Setup.1-1:PowerMgmt" -h xxxxx.xxx.xxx -P 443 -u root -p ****** -V -v -c dummy.cert -j utf-8 -y basic

DCIM_BIOSEnumeration
  AttributeName = PowerMgmt
  Caption
  CurrentValue = OsCtrl
  DefaultValue
  Description
  ElementName
  FQDD = BIOS.Setup.1-1
  InstanceID = BIOS.Setup.1-1:PowerMgmt
  IsOrderedList
  IsReadOnly = FALSE
  PendingValue
  PossibleValues = MaxPerf
  PossibleValues = Custom
  PossibleValues = ActivePwrCtrl
  PossibleValues = OsCtrl

-->

woensdag 13 juli 2016

using logstash as a proxy to rsyslog

Years ago I've set up a central logserver for our infrastructure at Compare Group. Back then I decided to use rsyslog for this. I've been using rsyslog for a long time. I've never liked it much. The weird syntax, the bad documentation with the many assumptions and opinions in it have not made it easy to debug weird behaviour. I think I'm not alone in this. That is why I was very happy when logstash came into existence. It has its own flaws, but at least it's easier to debug and use. I don't like spending a lot of time rewriting our (huge) rsyslog configuration to logstash in one go and I do want all logs to arrive on the same box. Since not all apps support logging to a different port than 514, it'll be either rsyslog or logstash that's going to listen on that port and handing off messages to the other as we slowly migrate from the one to the other.
Because I've decided that logstash is our new default, I've configured it to listen to port 514. Most of the logic to parse and write out messages is still in rsyslog, so many messages that arrive on our central logserver need to be forwarded from logstash to rsyslog. It seems there aren't many people doing this. Even searching (use www.duckduckgo.com) for this topic only finds references for people sending from rsyslog to logstash, not the other way around. So I figured this couldn't be that hard.
Indeed it's not 'hard' as in difficult. It's just a lot of work puzzeling the pieces together.

You'll need a logstash 2.2 or newer with the syslog output module. Maybe it'll work with older versions, but 2.2 is the version I built and tested this setup with. It's not included by default so install the plugin.
Below you'll find the resulting configurations for logstash that I've used. I'll explain the setup:
The input listens on udp and tcp. We receive most messages from servers via tcp but not all applications and devices support tcp so we need udp also.
the syslog output module by default sets the programname, facility and priority to a default that it gets from logstash. so all messages are then sent as if coming from logstash from the logstash server with the same facility and priority.

So the received message needs to be parsed by logstash and based on what the message contents is, we can overwrite the output plugin's defaults and set them back to the values that logstash received from the remove device.
We also want to specifically keep the original date-time the message was generated at the source, not the date-time of logstash. Due to buffering (especially when there are problems or is maintenance with the central log server) there may be large time differences between the source time and the logstash time.
Similarly not all devices send an RFC compliant date/time, so try to match those into the logstash format.
Because not all messages have a process-id, one must be set or the variable name will be printed in the output when it is not set. We don't want that, so the default is "-".

Once all that parsing is done and the messages can be put through, we can start to prepare for the migration. In this case we'll be matching the program name and putting that in specific folders based on the sending hostname and files according to the application. Many applications can log to the same file, but based on access rights some log to other files. Not everyone needs to see everything.



# This configuration togs incoming messages for processing by filters in other
# logstash config files. That way we have 1 place where the 'type' is declared
# and the logic of the processing can happens somewhere else. That should keep
# our implementation of logstash cleaner and more manageable.

input {
  udp {
    type => 'generic-syslog'
    port => 514
    buffer_size => 65536
 workers => 10
  }

  tcp {
    type => 'generic-syslog'
    port => 514
  }
}


filter {
  # Detect what type of message it is, then tag it accordingly.


    # Manual grok because of not using syslog input
    grok {
 # This match is based on the logstash pattern SYSLOGLINE but modified to
 # better match our needs.
      match => { "message" => "<%{POSINT:syslog_pri}>(?:%{SYSLOGTIMESTAMP:timestamp}|%{TIMESTAMP_ISO8601:timestamp8601}) (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource}+(?: %{SYSLOGPROG}:|)(\s)%{GREEDYDATA:syslog_message}" }
      add_field => [ "received_at", "%{@timestamp}" ]
      add_field => [ "received_from", "%{host}" ]
    }
    # Decode the facility and priority of the message.
    syslog_pri { }
    date {
      match => [ "syslog_timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ]
    }

    # if the PID was not found/set, make it a - for the syslog output to work properly towards rsyslog
    if !([pid] =~ /.+/) {
      mutate { add_field => [ "pid", "-" ] }
    }


# --------------------------- configureable part starts here -----------------------
# don't change anything above this line

# use http://stackoverflow.com/questions/29673424/logstash-replace-field-values-matching-pattern for this, it seems better

# firewall
#$template firewall, "/export/logs/%HOSTNAME%/security/%$YEAR%-%$MONTH%-%$DAY%/firewall.log"
#:syslogtag, contains, "firewall:" ?firewall
  if [program] == "firewall" {
    mutate { replace => ["type", "firewall"] }
  }

  # cron
#$template crond, "/export/logs/%HOSTNAME%/management/%$YEAR%-%$MONTH%-%$DAY%/cron.log"
#cron.*  ?crond
#& ~
  if [program] == "cron" or [program] == "crond" {
    mutate { replace => ["type", "cron"] }
  }

  # sudo
  # sshd
  #
}


output {
#  if [type] == 'generic-syslog' or [type] == 'firewall' or [type]=='cron' {
#    stdout { codec => rubydebug }
#  }

  if [type] == 'cron'  {
    file {
      path => "/export/logs/%{short_host}/management/%{+YYYY-MM-dd}/cron.by.logstash.log"
      codec => line { format => "%{message}" }
      flush_interval => "0"
    }
  }

  if [type] == 'firewall'  {
    file {
      path => "/export/logs/%{short_host}/security/%{+YYYY-MM-dd}/firewall.by.logstash.log"
      codec => line { format => "%{message}" }
      flush_interval => "0"
    }
  }

  # These are sent by logback/log4j through rsyslog by the application.
  if [type] == 'generic-syslog'  {
#    file {
#      path => "/export/logs/debugging/send-to-syslog.log"
#      codec => line { format => "%{message}" }
#      flush_interval => "0"
#    }
    syslog {
      appname  => "%{program}"
      procid   => "%{pid}"
      message  => "%{syslog_message}"
      facility => "%{syslog_facility}"
      severity => "%{syslog_severity}"
      sourcehost => "%{logsource}"
      host => "localhost"
      port => 5140
      workers => 10
    }
  }

  if [type] == 'generic-syslog' and [tag] == "_grokparsefailure" {
    file {
      path => "/export/logs/debugging/.grokfailure.log"
      codec => line { format => "%{syslog_message} AND %{message}" }
      flush_interval => "0"
    }
  }

}


woensdag 25 mei 2016

fixing a ceph imbalance

At 50-60% full, our ceph cluster was already saying 1 (and sometimes 2) osd’s were “near full”. That’s odd, for a cluster that’s only used 50-60%. But worse, there are 2 pg’s that are in degraded state because it could not write the 3rd replica. Maybe those problems are related? This situation started after created a new pool with two images in it back when we were on infernal.

details:
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            2 pgs degraded
            2 pgs stuck unclean
            2 pgs undersized
            recovery 696/10319262 objects degraded (0.007%)
            2 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18824, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e105351: 30 osds: 30 up, 30 in
      pgmap v13008499: 2880 pgs, 5 pools, 13406 GB data, 3359 kobjects
            40727 GB used, 20706 GB / 61433 GB avail
            696/10319262 objects degraded (0.007%)
                2878 active+clean
                   2 active+undersized+degraded


ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.92989 root default
-2 22.20000     host md002
18  1.84999         osd.18      up  1.00000          1.00000
19  1.84999         osd.19      up  1.00000          1.00000
20  1.84999         osd.20      up  1.00000          1.00000
21  1.84999         osd.21      up  1.00000          1.00000
22  1.84999         osd.22      up  1.00000          1.00000
23  1.84999         osd.23      up  1.00000          1.00000
24  1.84999         osd.24      up  1.00000          1.00000
25  1.84999         osd.25      up  1.00000          1.00000
26  1.84999         osd.26      up  1.00000          1.00000
27  1.84999         osd.27      up  1.00000          1.00000
28  1.84999         osd.28      up  1.00000          1.00000
29   1.84999        osd.29      up  1.00000          1.00000
-3 22.20000     host md008
 0  1.84999         osd.0       up  1.00000          1.00000
 3  1.84999         osd.3       up  1.00000          1.00000
 5  1.84999         osd.5       up  1.00000          1.00000
 7  1.84999         osd.7       up  1.00000          1.00000
11  1.84999         osd.11      up  1.00000          1.00000
10  1.84999         osd.10      up  1.00000          1.00000
 1  1.84999         osd.1       up  1.00000          1.00000
 6  1.84999         osd.6       up  1.00000          1.00000
 8  1.84999         osd.8       up  1.00000          1.00000
 9  1.84999         osd.9       up  1.00000          1.00000
 2  1.84999         osd.2       up  1.00000          1.00000
 4  1.84999         osd.4       up  1.00000          1.00000
-5 16.37999     host md005
14  2.73000         osd.14      up  0.79999          1.00000
15  2.73000         osd.15      up  0.79999          1.00000
16  2.73000         osd.16      up  0.79999          1.00000
12  2.73000         osd.12      up  0.79999          1.00000
17  2.73000         osd.17      up  0.79999          1.00000
13  2.73000         osd.13      up  0.79999          1.00000

The osd tree shows that we have 3 machines, two of them with about 22TB and one with about 16TB. Ofcourse that means that we can only store about 16TB with a replication count of three if we want each pg to be placed on a separate machine. Currently we are using about 13,4TB. And that’s about 13,4/16=83% of the available capacity. That’s becoming close to a problem.

Because not all pg’s are of equal size (it seems), some disk are also used a lot more than others:
[root@md010 ~]# sudo ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
18 1.84999  1.00000  1861G  1166G   695G 62.64 0.97 238
19 1.84999  1.00000  1861G  1075G   785G 57.78 0.89 223
20 1.84999  1.00000  1861G  1160G   701G 62.32 0.96 278
21 1.84999  1.00000  1861G   972G   888G 52.26 0.81 230
22 1.84999  1.00000  1861G  1073G   788G 57.65 0.89 235
23 1.84999  1.00000  1861G  1077G   784G 57.87 0.89 242
24 1.84999  1.00000  1861G  1135G   725G 61.00 0.94 226
25 1.84999  1.00000  1861G  1154G   707G 62.01 0.96 245
26 1.84999  1.00000  1861G  1096G   764G 58.91 0.91 243
27 1.84999  1.00000  1861G  1080G   781G 58.03 0.90 234
28 1.84999  1.00000  1861G  1036G   825G 55.66 0.86 237
29 1.84999  1.00000  1861G  1224G   637G 65.76 1.02 249
 0 1.84999  1.00000  1861G  1146G   714G 61.61 0.95 242
 3 1.84999  1.00000  1861G  1101G   760G 59.16 0.91 237
 5 1.84999  1.00000  1861G  1122G   739G 60.29 0.93 246
 7 1.84999  1.00000  1861G  1128G   732G 60.64 0.94 249
11 1.84999  1.00000  1861G  1115G   746G 59.92 0.93 226
10 1.84999  1.00000  1861G  1127G   733G 60.58 0.94 229
 1 1.84999  1.00000  1861G  1090G   771G 58.57 0.90 239
 6 1.84999  1.00000  1861G  1248G   612G 67.09 1.04 241
 8 1.84999  1.00000  1861G   902G   959G 48.47 0.75 243
 9 1.84999  1.00000  1861G   987G   874G 53.04 0.82 217
 2 1.84999  1.00000  1861G  1252G   608G 67.31 1.04 259
 4 1.84999  1.00000  1861G  1125G   735G 60.47 0.93 252
14 2.73000  0.79999  2792G  2321G   471G 83.13 1.28 489
15 2.73000  0.79999  2792G  2142G   649G 76.73 1.19 474
16 2.73000  0.79999  2792G  2081G   711G 74.52 1.15 460
12 2.73000  0.79999  2792G  2375G   416G 85.08 1.31 494
17 2.73000  0.79999  2792G  1947G   845G 69.72 1.08 462
13 2.73000  0.79999  2792G  2307G   484G 82.64 1.28 499
              TOTAL 61433G 39778G 21655G 64.75
MIN/MAX VAR: 0.75/1.31  STDDEV: 8.65

So indeed, some disks on md010 (which only has 16TB) are already 85% full.

Since the 3rd machine has some free disk slots I decided to move a disk from md002 and md008 each to md005. That will reduce space on md002 and md008 and add it to md005. Each node will then have 20TB. That should increase the maximum capacity we can store from 16TB to 20TB and it should decrease the used percentage from 83% to 13,4/20=67%. Much better already.

Doing this was easy:
stop the osd
unmount it
identify the physical disk (map the osd nr to the device id to the virtual disk to the physical disk and set blinking on)
move it physically
import the foreign disk on md005
mount it
start the osd
and repeat for the other osd.

The osd tree then looks like this:
ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 60.77974 root default
-2 20.34990     host md002
18  1.84999         osd.18      up  1.00000          1.00000
19  1.84999         osd.19      up  1.00000          1.00000
20  1.84999         osd.20      up  1.00000          1.00000
21  1.84999         osd.21      up  1.00000          1.00000
22  1.84999         osd.22      up  1.00000          1.00000
23  1.84999         osd.23      up  1.00000          1.00000
24  1.84999         osd.24      up  1.00000          1.00000
25  1.84999         osd.25      up  1.00000          1.00000
26  1.84999         osd.26      up  1.00000          1.00000
27  1.84999         osd.27      up  1.00000          1.00000
28  1.84999         osd.28      up  1.00000          1.00000
-3 20.34990     host md008
 0  1.84999         osd.0       up  1.00000          1.00000
 3  1.84999         osd.3       up  1.00000          1.00000
 5  1.84999         osd.5       up  1.00000          1.00000
 7  1.84999         osd.7       up  1.00000          1.00000
10  1.84999         osd.10      up  1.00000          1.00000
 1  1.84999         osd.1       up  1.00000          1.00000
 6  1.84999         osd.6       up  1.00000          1.00000
 8  1.84999         osd.8       up  1.00000          1.00000
 9  1.84999         osd.9       up  1.00000          1.00000
 2  1.84999         osd.2       up  1.00000          1.00000
 4  1.84999         osd.4       up  1.00000          1.00000
-5 20.07994     host md005
14  2.73000         osd.14      up  0.79999          1.00000
15  2.73000         osd.15      up  0.79999          1.00000
16  2.73000         osd.16      up  0.79999          1.00000
12  2.73000         osd.12      up  0.79999          1.00000
17  2.73000         osd.17      up  0.79999          1.00000
13  2.73000         osd.13      up  0.79999          1.00000
11  1.84999         osd.11      up  1.00000          1.00000
29  1.84998         osd.29      up  1.00000          1.00000

Since a lot of data has now moved and some pg’s may be stored on md005 twice now, lots of data will need to move and everything will rebalance. This is what that looks like:
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            13 pgs backfill_toofull
            214 pgs backfill_wait
            71 pgs backfilling
            738 pgs degraded
            52 pgs recovering
            610 pgs recovery_wait
            876 pgs stuck unclean
            76 pgs undersized
            recovery 774247/11332909 objects degraded (6.832%)
            recovery 2100222/11332909 objects misplaced (18.532%)
            2 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18834, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e105775: 30 osds: 30 up, 30 in; 292 remapped pgs
      pgmap v13024840: 2880 pgs, 5 pools, 13490 GB data, 3380 kobjects
            41515 GB used, 19918 GB / 61433 GB avail
            774247/11332909 objects degraded (6.832%)
            2100222/11332909 objects misplaced (18.532%)
                1933 active+clean
                 603 active+recovery_wait+degraded
                 171 active+remapped+wait_backfill
                  52 active+recovering+degraded
                  45 active+undersized+degraded+remapped+backfilling
                  30 active+undersized+degraded+remapped+wait_backfill
                  26 active+remapped+backfilling
                  12 active+remapped+wait_backfill+backfill_toofull
                   7 active+recovery_wait+degraded+remapped
                   1 active+undersized+degraded+remapped+wait_backfill+backfill_toofull
recovery io 351 MB/s, 88 objects/s
  client io 1614 kB/s rd, 158 kB/s wr, 0 op/s rd, 0 op/s wr


during the recovery:
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
18 1.84999  1.00000  1861G  1311G   549G 70.47 1.04 257
19 1.84999  1.00000  1861G  1189G   671G 63.92 0.94 268
20 1.84999  1.00000  1861G  1360G   501G 73.06 1.08 307
21 1.84999  1.00000  1861G  1217G   644G 65.39 0.96 279
22 1.84999  1.00000  1861G  1254G   606G 67.41 0.99 263
23 1.84999  1.00000  1861G  1184G   676G 63.64 0.94 262
24 1.84999  1.00000  1861G  1245G   616G 66.90 0.99 264
25 1.84999  1.00000  1861G  1292G   568G 69.44 1.02 285
26 1.84999  1.00000  1861G  1239G   621G 66.61 0.98 271
27 1.84999  1.00000  1861G  1210G   651G 65.01 0.96 259
28 1.84999  1.00000  1861G  1204G   656G 64.73 0.95 259
 0 1.84999  1.00000  1861G  1218G   643G 65.43 0.96 272
 3 1.84999  1.00000  1861G  1177G   684G 63.25 0.93 262
 5 1.84999  1.00000  1861G  1175G   685G 63.17 0.93 265
 7 1.84999  1.00000  1861G  1270G   591G 68.24 1.01 307
10 1.84999  1.00000  1861G  1162G   699G 62.44 0.92 262
 1 1.84999  1.00000  1861G  1200G   660G 64.51 0.95 274
 6 1.84999  1.00000  1861G  1271G   590G 68.28 1.01 277
 8 1.84999  1.00000  1861G  1024G   836G 55.04 0.81 273
 9 1.84999  1.00000  1861G  1101G   760G 59.16 0.87 250
 2 1.84999  1.00000  1861G  1291G   570G 69.38 1.02 276
 4 1.84999  1.00000  1861G  1167G   694G 62.70 0.92 264
14 2.73000  0.79999  2792G  2363G   428G 84.64 1.25 429
15 2.73000  0.79999  2792G  2189G   603G 78.39 1.15 398
16 2.73000  0.79999  2792G  2169G   622G 77.70 1.14 399
12 2.73000  0.79999  2792G  2388G   404G 85.52 1.26 452
17 2.73000  0.79999  2792G  2072G   720G 74.21 1.09 385
13 2.73000  0.79999  2792G  2321G   471G 83.13 1.22 425
11 1.84999  1.00000  1861G  1071G   790G 57.54 0.85 348
29 1.84998  1.00000  1861G   354G  1506G 19.06 0.28 353
              TOTAL 61433G 41703G 19730G 67.88
MIN/MAX VAR: 0.28/1.26  STDDEV: 11.50


and once it was done, the result is a very surprising:
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            13 pgs backfill_toofull
            6 pgs backfilling
            19 pgs stuck unclean
            recovery 120/10824993 objects degraded (0.001%)
            recovery 184018/10824993 objects misplaced (1.700%)
            2 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e107178: 30 osds: 30 up, 30 in; 19 remapped pgs
      pgmap v13145938: 2880 pgs, 5 pools, 13936 GB data, 3491 kobjects
            42471 GB used, 18962 GB / 61433 GB avail
            120/10824993 objects degraded (0.001%)
            184018/10824993 objects misplaced (1.700%)
                2857 active+clean
                  13 active+remapped+backfill_toofull
                   6 active+remapped+backfilling
                   4 active+clean+scrubbing

So apparently, despite having a better balance and beside that we should be able to store more, we actually have more pg’s with problems now.
When taking a closer look at the individual osd’s:

$ sudo ceph osd df tree
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42403G 19030G 69.02 1.00   0 root default
-2 20.34990        - 20477G 14093G  6383G 68.83 1.00   0     host md002
18  1.84999  1.00000  1861G  1250G   611G 67.16 0.97 247         osd.18
19  1.84999  1.00000  1861G  1248G   613G 67.04 0.97 259         osd.19
20  1.84999  1.00000  1861G  1446G   415G 77.68 1.13 301         osd.20
21  1.84999  1.00000  1861G  1246G   615G 66.96 0.97 271         osd.21
22  1.84999  1.00000  1861G  1197G   664G 64.32 0.93 253         osd.22
23  1.84999  1.00000  1861G  1221G   640G 65.61 0.95 256         osd.23
24  1.84999  1.00000  1861G  1312G   548G 70.51 1.02 257         osd.24
25  1.84999  1.00000  1861G  1434G   426G 77.08 1.12 279         osd.25
26  1.84999  1.00000  1861G  1233G   627G 66.27 0.96 261         osd.26
27  1.84999  1.00000  1861G  1215G   645G 65.31 0.95 248         osd.27
28  1.84999  1.00000  1861G  1287G   574G 69.15 1.00 255         osd.28
-3 20.34990        - 20477G 14215G  6261G 69.42 1.01   0     host md008
 0  1.84999  1.00000  1861G  1305G   555G 70.13 1.02 265         osd.0
 3  1.84999  1.00000  1861G  1324G   537G 71.13 1.03 257         osd.3
 5  1.84999  1.00000  1861G  1210G   650G 65.05 0.94 255         osd.5
 7  1.84999  1.00000  1861G  1402G   459G 75.34 1.09 295         osd.7
10  1.84999  1.00000  1861G  1365G   496G 73.33 1.06 254         osd.10
 1  1.84999  1.00000  1861G  1260G   600G 67.73 0.98 267         osd.1
 6  1.84999  1.00000  1861G  1502G   358G 80.73 1.17 272         osd.6
 8  1.84999  1.00000  1861G  1144G   717G 61.45 0.89 269         osd.8
 9  1.84999  1.00000  1861G  1176G   685G 63.18 0.92 240         osd.9
 2  1.84999  1.00000  1861G  1328G   532G 71.38 1.03 266         osd.2
 4  1.84999  1.00000  1861G  1194G   667G 64.17 0.93 251         osd.4
-5 20.07994        - 20478G 14093G  6385G 68.82 1.00   0     host md005
14  2.73000  0.84999  2792G  1861G   931G 66.64 0.97 392         osd.14
15  2.73000  0.79999  2792G  1769G  1022G 63.37 0.92 354         osd.15
16  2.73000  0.79999  2792G  1722G  1070G 61.68 0.89 353         osd.16
12  2.73000  0.79999  2792G  2016G   775G 72.22 1.05 405         osd.12
17  2.73000  0.79999  2792G  1647G  1145G 59.00 0.85 349         osd.17
13  2.73000  0.79999  2792G  1832G   960G 65.60 0.95 378         osd.13
11  1.84999  1.00000  1861G  1627G   233G 87.44 1.27 326         osd.11
29  1.84998  1.00000  1861G  1616G   245G 86.83 1.26 345         osd.29
               TOTAL 61433G 42403G 19030G 69.02
MIN/MAX VAR: 0.85/1.27  STDDEV: 6.91

We can see that the disks that got moved to md005 (osd’s 11 and 29) ore now almost full, while the other disks in md005, which are much larger, don’t have that much usage at all. We can also see that the larger disks have a lower reweighs, so less data is going there. probably that is no longer needed. For the least used osd’s in md005 I’ll go and increase the reweighs so more data will go there.

There are 2 ways to reweighs:
ceph crush reweigh
ceph osd reweight

Since our osd reweighs is not the default ‘1’, I’ll bring those closer to 1. Also because that setting is apparently not persistent and when an osd goes ‘out’, the setting is lost. More information on that can be found here: http://ceph.com/planet/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight/

$ sudo ceph osd reweight 15 0.85
reweighted osd.15 to 0.85 (d999)
$ sudo ceph osd reweight 16 0.85
reweighted osd.16 to 0.85 (d999)
$ sudo ceph osd reweight 17 0.85
reweighted osd.17 to 0.85 (d999)

I only reweighed the 3 osd’s that have least data on them, and only increased the weight by 0.05. I did this because I wanted to see the effects before taking the next step, and increasing by 0.05 shouldn’t take that many hours to re-align all the PG’s that the crush map now calculated a different place for.

    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            clock skew detected on mon.md010
            16 pgs backfill_toofull
            15 pgs backfill_wait
            22 pgs backfilling
            51 pgs degraded
            23 pgs recovering
            25 pgs recovery_wait
            56 pgs stuck unclean
            recovery 48630/10962135 objects degraded (0.444%)
            recovery 458174/10962135 objects misplaced (4.180%)
            2 near full osd(s)
            Monitor clock skew detected
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18838, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e107399: 30 osds: 30 up, 30 in; 51 remapped pgs
      pgmap v13147453: 2880 pgs, 5 pools, 13937 GB data, 3491 kobjects
            42445 GB used, 18988 GB / 61433 GB avail
            48630/10962135 objects degraded (0.444%)
            458174/10962135 objects misplaced (4.180%)
                2778 active+clean
                  25 active+recovery_wait+degraded
                  23 active+recovering+degraded
                  22 active+remapped+backfilling
                  13 active+remapped+backfill_toofull
                  12 active+remapped+wait_backfill
                   3 active+degraded
                   3 active+remapped+wait_backfill+backfill_toofull
                   1 active+remapped
recovery io 361 MB/s, 90 objects/s
  client io 3611 kB/s wr, 0 op/s rd, 3 op/s wr

see? Only a small amount of PG’s is actually involved. Hopefully this change will result in osd’s 11 and 29 to become utilised a bit less and numbers 15 through 17 a bit more.

<some time later>

Indeed it made the situation slightly less bad:
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42538G 18895G 69.24 1.00   0 root default
-2 20.34990        - 20477G 14123G  6353G 68.97 1.00   0     host md002
18  1.84999  1.00000  1861G  1237G   624G 66.47 0.96 246         osd.18
19  1.84999  1.00000  1861G  1243G   617G 66.81 0.96 261         osd.19
20  1.84999  1.00000  1861G  1472G   389G 79.09 1.14 303         osd.20
21  1.84999  1.00000  1861G  1241G   619G 66.70 0.96 268         osd.21
22  1.84999  1.00000  1861G  1220G   640G 65.58 0.95 254         osd.22
23  1.84999  1.00000  1861G  1207G   654G 64.85 0.94 252         osd.23
24  1.84999  1.00000  1861G  1327G   533G 71.32 1.03 258         osd.24
25  1.84999  1.00000  1861G  1434G   426G 77.08 1.11 279         osd.25
26  1.84999  1.00000  1861G  1221G   639G 65.64 0.95 262         osd.26
27  1.84999  1.00000  1861G  1228G   632G 66.01 0.95 249         osd.27
28  1.84999  1.00000  1861G  1287G   574G 69.14 1.00 255         osd.28
-3 20.34990        - 20477G 14247G  6230G 69.58 1.00   0     host md008
 0  1.84999  1.00000  1861G  1293G   568G 69.48 1.00 262         osd.0
 3  1.84999  1.00000  1861G  1326G   535G 71.25 1.03 258         osd.3
 5  1.84999  1.00000  1861G  1234G   627G 66.29 0.96 256         osd.5
 7  1.84999  1.00000  1861G  1406G   455G 75.53 1.09 296         osd.7
10  1.84999  1.00000  1861G  1353G   507G 72.73 1.05 254         osd.10
 1  1.84999  1.00000  1861G  1274G   587G 68.46 0.99 267         osd.1
 6  1.84999  1.00000  1861G  1485G   375G 79.80 1.15 269         osd.6
 8  1.84999  1.00000  1861G  1144G   717G 61.46 0.89 269         osd.8
 9  1.84999  1.00000  1861G  1204G   657G 64.69 0.93 242         osd.9
 2  1.84999  1.00000  1861G  1331G   529G 71.54 1.03 266         osd.2
 4  1.84999  1.00000  1861G  1193G   668G 64.09 0.93 250         osd.4
-5 20.07994        - 20478G 14167G  6311G 69.18 1.00   0     host md005
14  2.73000  0.84999  2792G  1961G   831G 70.23 1.01 398         osd.14
15  2.73000  0.84999  2792G  1850G   942G 66.25 0.96 372         osd.15
16  2.73000  0.84999  2792G  1780G  1012G 63.76 0.92 366         osd.16
12  2.73000  0.79999  2792G  1941G   851G 69.51 1.00 392         osd.12
17  2.73000  0.84999  2792G  1669G  1122G 59.79 0.86 354         osd.17
13  2.73000  0.79999  2792G  1769G  1023G 63.36 0.92 369         osd.13
11  1.84999  1.00000  1861G  1601G   260G 86.03 1.24 315         osd.11
29  1.84998  1.00000  1861G  1593G   268G 85.59 1.24 332         osd.29
               TOTAL 61433G 42538G 18895G 69.24
MIN/MAX VAR: 0.86/1.24  STDDEV: 6.47

The usage on 11 and 29 decreased and on 15, 16 and 17 it increased. Only tiny amounts, but the weighting change was tiny also. Lets make a bigger change:
$ sudo ceph osd reweight 16 0.90
$ sudo ceph osd reweight 17 0.95
$ sudo ceph osd reweight 13 0.90

and the result:

ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42434G 18999G 69.07 1.00   0 root default
-2 20.34990        - 20477G 14098G  6379G 68.85 1.00   0     host md002
18  1.84999  1.00000  1861G  1207G   654G 64.85 0.94 244         osd.18
19  1.84999  1.00000  1861G  1270G   591G 68.25 0.99 265         osd.19
20  1.84999  1.00000  1861G  1459G   401G 78.42 1.14 300         osd.20
21  1.84999  1.00000  1861G  1213G   648G 65.19 0.94 266         osd.21
22  1.84999  1.00000  1861G  1220G   640G 65.59 0.95 252         osd.22
23  1.84999  1.00000  1861G  1213G   647G 65.20 0.94 253         osd.23
24  1.84999  1.00000  1861G  1314G   546G 70.63 1.02 257         osd.24
25  1.84999  1.00000  1861G  1433G   428G 77.00 1.11 278         osd.25
26  1.84999  1.00000  1861G  1225G   636G 65.83 0.95 262         osd.26
27  1.84999  1.00000  1861G  1243G   618G 66.78 0.97 251         osd.27
28  1.84999  1.00000  1861G  1295G   566G 69.60 1.01 254         osd.28
-3 20.34990        - 20477G 14230G  6246G 69.50 1.01   0     host md008
 0  1.84999  1.00000  1861G  1295G   565G 69.61 1.01 260         osd.0
 3  1.84999  1.00000  1861G  1343G   518G 72.17 1.04 257         osd.3
 5  1.84999  1.00000  1861G  1225G   636G 65.82 0.95 257         osd.5
 7  1.84999  1.00000  1861G  1409G   452G 75.71 1.10 298         osd.7
10  1.84999  1.00000  1861G  1339G   522G 71.93 1.04 252         osd.10
 1  1.84999  1.00000  1861G  1269G   592G 68.19 0.99 264         osd.1
 6  1.84999  1.00000  1861G  1496G   364G 80.39 1.16 269         osd.6
 8  1.84999  1.00000  1861G  1145G   715G 61.55 0.89 270         osd.8
 9  1.84999  1.00000  1861G  1204G   657G 64.68 0.94 242         osd.9
 2  1.84999  1.00000  1861G  1330G   531G 71.45 1.03 268         osd.2
 4  1.84999  1.00000  1861G  1171G   690G 62.93 0.91 245         osd.4
-5 20.07994        - 20478G 14105G  6373G 68.88 1.00   0     host md005
14  2.73000  0.84999  2792G  1864G   928G 66.76 0.97 378         osd.14
15  2.73000  0.84999  2792G  1720G  1071G 61.61 0.89 354         osd.15
16  2.73000  0.89999  2792G  1712G  1079G 61.33 0.89 368         osd.16
12  2.73000  0.79999  2792G  1890G   902G 67.69 0.98 376         osd.12
17  2.73000  0.95000  2792G  1870G   921G 66.99 0.97 395         osd.17
13  2.73000  0.89999  2792G  1942G   849G 69.57 1.01 400         osd.13
11  1.84999  1.00000  1861G  1511G   350G 81.18 1.18 301         osd.11
29  1.84998  1.00000  1861G  1592G   269G 85.55 1.24 318         osd.29
               TOTAL 61433G 42434G 18999G 69.07
MIN/MAX VAR: 0.89/1.24  STDDEV: 6.08

so very slowly we’re getting there. some more tweaking and…..

    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_WARN
            1 near full osd(s)
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18842, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e107725: 30 osds: 30 up, 30 in
      pgmap v13170442: 2880 pgs, 5 pools, 13944 GB data, 3493 kobjects
            42348 GB used, 19085 GB / 61433 GB avail
                2879 active+clean
                   1 active+clean+scrubbing+deep
  client io 50325 kB/s rd, 13882 kB/s wr, 24 op/s rd, 13 op/s wr

$ sudo ceph osd df tree
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1 60.77974        - 61433G 42348G 19085G 68.93 1.00   0 root default
-2 20.34990        - 20477G 14096G  6381G 68.84 1.00   0     host md002
18  1.84999  1.00000  1861G  1191G   670G 64.00 0.93 242         osd.18
19  1.84999  1.00000  1861G  1287G   573G 69.18 1.00 266         osd.19
20  1.84999  1.00000  1861G  1433G   428G 77.00 1.12 298         osd.20
21  1.84999  1.00000  1861G  1210G   650G 65.04 0.94 264         osd.21
22  1.84999  1.00000  1861G  1211G   649G 65.10 0.94 253         osd.22
23  1.84999  1.00000  1861G  1242G   619G 66.72 0.97 255         osd.23
24  1.84999  1.00000  1861G  1316G   544G 70.74 1.03 258         osd.24
25  1.84999  1.00000  1861G  1411G   450G 75.82 1.10 277         osd.25
26  1.84999  1.00000  1861G  1235G   626G 66.35 0.96 261         osd.26
27  1.84999  1.00000  1861G  1258G   602G 67.62 0.98 252         osd.27
28  1.84999  1.00000  1861G  1296G   565G 69.63 1.01 254         osd.28
-3 20.34990        - 20477G 14193G  6284G 69.31 1.01   0     host md008
 0  1.84999  1.00000  1861G  1298G   563G 69.76 1.01 260         osd.0
 3  1.84999  1.00000  1861G  1333G   528G 71.61 1.04 256         osd.3
 5  1.84999  1.00000  1861G  1224G   637G 65.77 0.95 256         osd.5
 7  1.84999  1.00000  1861G  1395G   466G 74.96 1.09 297         osd.7
10  1.84999  1.00000  1861G  1344G   517G 72.20 1.05 254         osd.10
 1  1.84999  1.00000  1861G  1269G   591G 68.22 0.99 264         osd.1
 6  1.84999  1.00000  1861G  1484G   376G 79.76 1.16 268         osd.6
 8  1.84999  1.00000  1861G  1136G   725G 61.02 0.89 270         osd.8
 9  1.84999  1.00000  1861G  1204G   656G 64.71 0.94 242         osd.9
 2  1.84999  1.00000  1861G  1330G   530G 71.48 1.04 268         osd.2
 4  1.84999  1.00000  1861G  1171G   689G 62.95 0.91 245         osd.4
-5 20.07994        - 20478G 14059G  6419G 68.65 1.00   0     host md005
14  2.73000  0.84999  2792G  1826G   966G 65.39 0.95 374         osd.14
15  2.73000  0.89999  2792G  1841G   951G 65.94 0.96 371         osd.15
16  2.73000  0.95000  2792G  1783G  1008G 63.87 0.93 380         osd.16
12  2.73000  0.79999  2792G  1833G   958G 65.66 0.95 368         osd.12
17  2.73000  0.95000  2792G  1769G  1023G 63.36 0.92 381         osd.17
13  2.73000  0.89999  2792G  1903G   889G 68.15 0.99 395         osd.13
11  1.84999  1.00000  1861G  1485G   375G 79.81 1.16 298         osd.11
29  1.84998  1.00000  1861G  1616G   245G 86.81 1.26 313         osd.29
               TOTAL 61433G 42348G 19085G 68.93
MIN/MAX VAR: 0.89/1.26  STDDEV: 5.86

So now the problem of the 2 PG’s that were active+undersized+degraded are now fixed. Still there is 1 osd that’s ‘nearly full’.
The manual tuning has been fun, but it can be done automatically also:
 sudo ceph osd test-reweight-by-utilization
no change
moved 146 / 8640 (1.68981%)
avg 288
stddev 48.1311 -> 50.5021 (expected baseline 16.6853)
min osd.9 with 242 -> 245 pgs (0.840278 -> 0.850694 * mean)
max osd.13 with 395 -> 383 pgs (1.37153 -> 1.32986 * mean)

oload 120
max_change 0.05
max_change_osds 4
average 0.689340
overload 0.827209
osd.29 weight 1.000000 -> 0.950012
osd.17 weight 0.949997 -> 0.999985
osd.16 weight 0.949997 -> 0.999985
osd.14 weight 0.849991 -> 0.896011

that looks like a sensible change. Let’s apply it.
And then the result is…
    cluster 6318a6a2-808b-45a1-9c89-31575c58de49
     health HEALTH_OK
     monmap e7: 4 mons at {md002=172.19.20.2:6789/0,md005=172.19.20.5:6789/0,md008=172.19.20.8:6789/0,md010=172.19.20.10:6789/0}
            election epoch 18870, quorum 0,1,2,3 md002,md005,md008,md010
     osdmap e108690: 30 osds: 30 up, 30 in
            flags sortbitwise
      pgmap v13455986: 2880 pgs, 5 pools, 14064 GB data, 3523 kobjects
            42709 GB used, 18724 GB / 61433 GB avail
                2879 active+clean
                   1 active+clean+scrubbing+deep


looking good, all done!