vrijdag 6 juni 2014

Elasticsearch multicast debugging

Elasticsearch enables fast searches through vast amounts of data. It's a scalable storage system for objects.
I was having trouble with a cluster that would work initially but later it would lose one or more nodes and result in a split-brain situation. This blog article explains how I fixed this.

There are 2 ways to build an elasticsearch cluster:
- unicast
- multicast

With unicast you provide a list of IP addresses or hostnames of each of the nodes that is to be a part of the custer. Each node needs to know atleast part of the list of nodes in order to find the master node.
With multicast you don't need to configure anything, each node subscribes to a unicast group and listens to what any other node is announcing to that same group. Each node joins a cluster by advertising itself to a multicast address (224.2.2.4 by default with port 54328), where all the nodes have joined. We are using elasticsearch with multicast enabled.

Searching for the terms 'elasticsearch' and 'split-brain' you'll find many people with similar problems. The 1st thing to note is that there is a setting called discovery.zen.minimum_master_nodes and it is not set by default. Setting this N/2+1 is a good idea, because it will prevent a cluster split from choosing a new master unless it is the bigger half of the cluster (which is what you want) and the other half of the cluster won't do anything (it will stop functioning).
So I set this setting, but it did not prevent our problem of getting splits in the 1st place. It only prevented a cluster split from becoming 2 individual clusters with the ghastly effect of having data contents diverging and having to wipe the entire cluster and rebuild the indexes. A good thing to prevent, but it was not what I was looking for.

Next thing was to look at the logs. I know that our splits mostly happened when we restarted a cluster node. It would not be able to rejoin the cluster after it was restarted, even though it worked just fine before the restart. Looking in the logs I could see the node coming up, not finding the other nodes and then just waiting there. I turned up the logging by setting 'discovery: TRACE' in /etc/elasticsearch/logging.yml and saw the node doing 'pings' to the multicast address, but it was not receiving anything back. The logs of the other nodes did not show them receiving the pings.
Using tcpdump I noticed the traffic was indeed being sent out the network interface, but nothing was received by the other nodes. Ah-ha! it's a network problem! (but let me check for any firewall settings real fast first. nope, no firewall blocking this..)


[@td019 ~]$ sudo tcpdump -i eth0 port 54328 -X -n -vvv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:48:37.209282 IP (tos 0x0, ttl 3, id 0, offset 0, flags [DF], proto UDP (17), length 118)
    1.2.2.9.54328 > 224.2.2.4.54328: [bad udp cksum 57a!] UDP, length 90
        0x0000:  4500 0076 0000 4000 0311 d549 ac14 1413  E..v..@....I....
        0x0010:  e002 0204 d438 d438 0062 a2a1 0109 0804  .....8.8.b......
        0x0020:  a7a2 3e00 0000 060d 6367 5f65 735f 636c  ..>.....aa_bb_cl
        0x0030:  7573 7465 7206 426f 6f6d 6572 1677 6c50  uster.Boomer.wlP
        0x0040:  5372 4f72 6751 3571 5452 775a 7034 4371  SrOrgQ5qTRwZp4Cq
        0x0050:  7961 5105 7464 3031 390c 3137 322e 3230  yaQ.td019.172.20
        0x0060:  2e32 302e 3139 0001 0004 ac14 1413 0000  .20.19..........
        0x0070:  2454 00a7 a23e                           $T...>


 Here I see the traffic leaving the host towards the multicast address, but nothing is received. Why is this happening? Maybe it's a bug in elasticsearch. I upgraded from 1.0.1 to 1.2.1, but it made no difference. Lets make sure that it's indeed the network. I downloaded jgroups (www.jgroups.org) and using some googled syntax (http://www.tomecode.com/2010/06/12/jboss-5-clustering-in-firewall-environment-or-how-to-test-the-multicast-address-and-port/) I ran some tests by joining on the same group address and port. And then I started sending test messages from node1 to 2 and 3, from 2 to 1 and 3 and from 3 to 1 and 2. Guess what... with jgroups I could also not get any messages from or to one of the nodes.

These nodes are virtual machines. So I moved that virtual machine to a different host and _boom_, it started working. More proof that it's a network issue. As long as the nodes are connected to the same switch, everything is fine. But when the nodes are on different switches, the IGMP traffic is apparantly not forwarded. That could possibly have to do with some switches doing IGMP snooping to try and be smart about where to send IGMP traffic. Configuring a static  IGMP group in the switches solved the problem.

We can now use elasticsearch without having to worry about getting split-brains when we restart a node. And when a split does occur, it won't hurt too badly because the smaller part of the cluster won't elect a new master.


donderdag 21 november 2013

updating a fusion-io duo drive of 1.28TB

This update was done on a CentOS 6.4 machine.
First, download all rpms and docs from support.fusionio.com (or if you're on dell hardware, from dell.fusionio.com)
# Read the release notes, note the order for updating.
# Have a look at the current status and version of your drives, make sure they are healthy and ok
fio-status
[root@pd025 fusion_update]# fio-status
Unable to get product information for /dev/fct0:0.
Unable to get format information for /dev/fct0.
Unable to get data volume information for /dev/fct0.
Unable to get system monitor information for /dev/fct0.
Unable to get product information for /dev/fct1:0.
Unable to get format information for /dev/fct1.
Unable to get data volume information for /dev/fct1.
Unable to get system monitor information for /dev/fct1.

Found 2 ioDrives in this system with 1 ioDrive Duo
Fusion-io driver version: 2.3.11 build 183

Adapter: ioDrive Duo
        ioDrive Duo HL SN:xxxxxx
        External Power: NOT connected
        Sufficient power available: Unknown
        Connected ioDimm modules:
          fct0: ioDIMM3 640GB MLC SN:xxxxxx
          fct1: ioDIMM3 640GB MLC SN:xxxxxx

fct0    Status unknown: Driver is in MINIMAL MODE:
                Firmware is out of date. Update firmware.
        ioDIMM3 640GB MLC SN:xxxxxx
        Located in slot 0 Upper of ioDrive Duo SN:xxxxxx
        PCI:07:00.0
        Firmware v5.0.6, rev 101583
        Geometry and capacity information not available.
        Sufficient power available: Unknown
        Internal temperature: 39.4 degC, max 39.4 degC

fct1    Status unknown: Driver is in MINIMAL MODE:
                Firmware is out of date. Update firmware.
        ioDIMM3 640GB MLC SN:xxxxxx
        Located in slot 1 Lower of ioDrive Duo SN:xxxxxx
        PCI:08:00.0
        Firmware v5.0.6, rev 101583
        Geometry and capacity information not available.
        Sufficient power available: Unknown
        Internal temperature: 35.9 degC, max 36.9 degC

[root@pd025 fusion_update]

# unmount the drive
umount /export

# detach the drives
fio-detach /dev/ftc0
fio-detach /dev/ftc1

# unload the driver
rmmod iomemory_vsl
# if you get a message here saying the module is still in use, make sure that all mountpoints, services, etc using the drives are unmounted/stopped. that should already have happened by doing the unmount.

# uninstall the software
rpm -e fio-* iomemory*

# install the new software you downloaded to a local folder
rpm -ivh ./*rpm

# update the firmware
[root@pd025 fusion_update]# fio-update-iodrive -p iodrive_107053.fff


***** Firmware will NOT be updated (detected --pretend) *****


***** Firmware will NOT be updated (detected --pretend) *****

Device ID 0 (/dev/fct0) Updating device firmware from 5.0.6.101583 to 5.0.7.107053
Device ID 1 (/dev/fct1) Updating device firmware from 5.0.6.101583 to 5.0.7.107053

WARNING: DO NOT TURN OFF POWER OR RUN ANY IODRIVE UTILITIES WHILE THE FIRMWARE UPDATE IS IN PROGRESS
  Please wait...this could take a while

Progress
-------------------------


Results
-------------------------
0: Firmware updated successfully
1: Firmware updated successfully


You MUST now reboot this machine before the new firmware will be activated!
[root@pd025 fusion_update]#

# reboot
reboot

# done

dinsdag 21 augustus 2012

what spacewalk doesn't do

A month ago I installed spacewalk to use instead of mrepo. That way we should be able to manage our rpm repositories much easier. It works nicely, we now have a good overview of what machines are behind in their updates and what the updates are. patching them is just a few clicks. it saves time.
But it also doesn't do a few things I thought it would:
- no ability to select specific packages. so the yum.repo.d include and exclude commands are not supported
- no ability to 'progress' packages from one repo to another. so if our centos-updates repository is updated every night and we use it to update our DTAP street, then we run the risk of deploying newer packages to P than we did to D because the repo is by then newer. So we need to sync by hand instead.

vrijdag 27 juli 2012

Dell lifecycle controller

Dell has this great feature in their modern servers, the Unified Server Configurator (what's in a name), USC. Or it's the Lifecycle controller.. never know which is which. You can get to it through the bios and it allows you to update the system firmware. That's a neat trick, because it saves you from needing to install any OS dependent packages. Especially on linux that can be a lifesave because Dell only supports RedHat and Suse. They make life intentionally hard for CentOS users by trying to put as many 'redhat' checks in their code, previously only on /etc/redhadt-release but recently also on /etc/issue. So those need to be modified on CentOS systems in order for Dell linux firmware updates to work. Not so with the lifecycle controller.
Just press F10 to enter the 'sytem services' (another name for the USC?) in the bios screen, and select 'usc settings','network settings' to set up the network connection. Then through 'platform update', 'launch platform update', 'ftp server' you end up at a screen where you can configure the ftp server (ftp.dell.com or ftp.euro.dell.com [why don't they use geo-dns?]). And then you can 'test network connection' or just press next.

That's where the fun stopped for me. This server, an r510, has nic1 connected to the san and a pci intel 10gig nic going to the routable network and internet. The update apparantly is only possible over NIC1, even though you are able to configure the USC for a different nic. Such a bummer..

I tried this instead, but that only works for the BIOS. It put me on the right track though. I needed to update the H700 perc controller, so I ended up doing this:
./sasdupie -u -s /usr/share/firmware/dell/dup/pci_firmware_ven_0x1000_dev_0x0079_subven_0x1028_subdev_0x1f17_version_a10/ -o doit.txt
and in the doit.txt it said 'The operation was successful. '
`omreport storage controller` confirmed it:
Firmware Version                              : 12.10.2-0004

done.

donderdag 20 oktober 2011

ocz vertex 3 linux performance: promise vs onboard ATI SB700/SB800

I finally bought myself an SSD. I've been using spinning run for ever and recently we started using SSDs for the servers at work. I just had to go and buy one for myself. The windows boot time was starting to really annoy me.
At €99,- for 60GB it's not cheap, but it should be fast. The box says up to 535MB/sec read and 490MB/sec write. I doubt that'll even be achieven in real world situations. Ofcourse I do want to try :)

So first I hooked it up to one of the ports of my promise sata 300tx4 controller. This was the hdparm result:

/dev/sdl:
 Timing cached reads:   2514 MB in  2.00 seconds = 1257.10 MB/sec
 Timing buffered disk reads: 274 MB in  3.00 seconds =  91.33 MB/sec

That's... not any faster than my harddisks... :(


I couldn't believe it. So I moved the SSD over to my onboard ati sata controller, and hdparm did something different:
/dev/sdm:
 Timing cached reads:   2830 MB in  2.00 seconds = 1414.96 MB/sec
 Timing buffered disk reads: 1074 MB in  3.00 seconds = 357.86 MB/sec

That's better! Not quite there yet, but atleast now I get the idea I spent my money wisely.

Especially compared to my OS disk, and a newer 2TB disk I bought:

/dev/hda: (pata disk)
 Timing cached reads:   2516 MB in  2.00 seconds = 1257.80 MB/sec
 Timing buffered disk reads: 166 MB in  3.02 seconds =  54.90 MB/sec

/dev/disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30TG22D:
 Timing cached reads:   2222 MB in  2.00 seconds = 1111.31 MB/sec
 Timing buffered disk reads: 300 MB in  3.02 seconds =  99.50 MB/sec

I find this huge performance difference between my 2 sata controllers a bit disconcernting. Especially since the promise controller is a sata-II capable device (300MB/sec), but not even achieving sata-I speeds (150MB/sec). At the same time my onboard sata controller is sata-III capable (600MB/sec) and not achieving that, although it does get a little bit above the sata-II spec. What could cause this?
One limiting factor is the PCI bus that the promise card is on. That limits it to 266MB/sec so that's not the bottleneck. The sata cables according to the sata wikipedia page should all go up to sata-III. Maybe the rest of the system is too slow to keep up? Let's do an easier test:

server ~ # mount /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC /mnt/ssd
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 41.3909 s, 247 MB/s
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 30.1224 s, 340 MB/s
server ~ #

While doing this test the io-utilisation was at 40-60% ('iostat -x 1 sdm) when writing and 75% when reading. So the device could handle more, but it was just waiting for the system. In fact, taking that 247MB/sec and correcting for the IO utilisation, the device indeed seems to be able to handle about 500MB/sec. Wow.
So why isn't the device being 100% utilised? When looking at 'top' while doing these tests it shows one of the 2 cores to be 100% busy when writing to the device, but only 50% busy when reading.
read: 40% waiting and 60% system time. 0% idle.
increasing the blocksize to 1MB increases read speed to 360MB/sec
switching to direct io and nonblocking io the speed increases further:
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000 bs=1048576  iflag=direct,nonblock
9765+1 records in
9765+1 records out
10240000000 bytes (10 GB) copied, 24.5059 s, 418 MB/s

and writing:
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=5000 bs=1048576 oflag=direct,nonblock
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 14.8777 s, 352 MB/s

75% io wait, 23% system time. 75% utilisation

So using a larger blocksize with nonblocking io and direct read/write the io increases slightly. But we're still not where whe should be.

Let's cut ext4 out of the loop:
server ~ # dd if=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC of=/dev/null count=20000000 bs=1048576 iflag=direct,nonblock
57241+1 records in
57241+1 records out
60022480896 bytes (60 GB) copied, 136.918 s, 438 MB/s
server ~ #
 this results in 90% utilisation and 87% cpu waiting on io, with 12% cpu system time.

write:
server ~ # dd if=/dev/zero of=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC count=20000 bs=1048576 oflag=direct,nonblock skip=2
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB) copied, 58.2396 s, 360 MB/s
server ~ #
80% util, 75% wait, 25% system

Based on all if this I think the performance of this SSD is CPU bound.
The difference between the promise and the onboard sata controller may be related to AHCI: the stock kernel driver for promise doesn't support AHCI.

Now that I'm done, I'm cleaning up the SSD before handing it over to my windows pc:
i=0; while [ $i -lt 117231408 ]; do echo $i:64000; i=$(((i+64000))); done  | hdparm --trim-sector-ranges-stdin --please-destroy-my-drive /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC
This makes sure that all data is 'trimmed' aka removed without writing to the device. This keeps the SSD fast, because a write to a block of data that already has data in it is much slower that a write to a block that is empty, aka trimmed.

This is what bonnie has to say:
Version 1.96Sequential OutputSequential InputRandom
Seeks
Sequential CreateRandom Create
ConcurrencySizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU
hda-pata46576M640942689911146834182898457836145.171680919++++++++23202223060740++++++++++++++++
Latency36690us2432ms4584ms17325us229ms2529msLatency12317us849us828us107us67us73us
hda-pata46576M681932679410144654145398433856135.2716640425++++++++23371601809466++++++++2402760
Latency13723us2087ms3965ms20442us267ms2308msLatency13195us2236us2291us252us136us67us
hda-pata46576M492962643210143384192194398764120.661642614++++++++17685132124021++++++++2314017
Latency38599us2195ms2586ms20359us404ms2291msLatency12161us633us895us1212us257us3930us
zfs-246576M202218509111549810159898644938142.7416136221++++++++683823271421++++++++887528
Latency940ms2100ms2053ms27966us268ms1879msLatency134ms6559us4142us66330us105us10113us
zfs-246576M30191966412155451099898651357164.6516222719309958625324305618++++++++973122
Latency432ms1860ms2054ms25254us218ms1825msLatency97645us2929us10128us76339us82us18389us
zfs-246576M311918794121558510165598590089165.15161337222714117268126141320++++++++878927
Latency384ms1890ms1993ms21819us236ms1747msLatency61531us3114us7322us339ms105us42091us
ssd-promise46576M6049997059244729815170099119279195231220161539317++++++++++++++++++++++++++++++++++++++++
Latency38396us721ms804ms10871us9170us81333usLatency83us531us549us97us8us41us
ssd-promise46576M7669897783244787015181099120092204713200161335715++++++++++++++++++++++++++++++++++++++++
Latency11147us661ms787ms11471us3283us97766usLatency210us531us547us85us10us46us
ssd-promise46576M7749897767244775715146499120094204672191161069719++++++++++++++++++++++++++++++++++++++++
Latency11036us226ms775ms12233us2866us91878usLatency439us531us549us90us9us64us
ssd-ati46576M6489945483743201613291316994824483512924381161257314++++++++++++++++++++++++++++++++++++++++
Latency38218us175ms109ms12974us124ms3048usLatency90us524us556us87us7us30us
ssd-ati46576M6699946645547202751282659994752953811540333161503817++++++++++++++++++++++++++++++++++++++++
Latency38867us127ms111ms4088us121ms2753usLatency256us524us558us91us63us63us
ssd-ati46576M6259946771350176020571345994848394711420342162159524++++++++++++++++++++++++++++++++++++++++
Latency39029us113ms54983us11317us74753us4429usLatency207us534us573us86us65us66us

As you can see, the SSD is much, much faster than any disk. ZFS is reaaaalllly slow, but that's because I'm running it through fuse so all IO needs to go through userspace. But that's ok, I don't need speed on those disks. Only safety. And once zfsonlinux gets around to finishing their code it should speed up tremendously.

If I get to buy a faster PC I'll update this doc with new measurements.