vrijdag 6 juni 2014

Elasticsearch multicast debugging

Elasticsearch enables fast searches through vast amounts of data. It's a scalable storage system for objects.
I was having trouble with a cluster that would work initially but later it would lose one or more nodes and result in a split-brain situation. This blog article explains how I fixed this.

There are 2 ways to build an elasticsearch cluster:
- unicast
- multicast

With unicast you provide a list of IP addresses or hostnames of each of the nodes that is to be a part of the custer. Each node needs to know atleast part of the list of nodes in order to find the master node.
With multicast you don't need to configure anything, each node subscribes to a unicast group and listens to what any other node is announcing to that same group. Each node joins a cluster by advertising itself to a multicast address (224.2.2.4 by default with port 54328), where all the nodes have joined. We are using elasticsearch with multicast enabled.

Searching for the terms 'elasticsearch' and 'split-brain' you'll find many people with similar problems. The 1st thing to note is that there is a setting called discovery.zen.minimum_master_nodes and it is not set by default. Setting this N/2+1 is a good idea, because it will prevent a cluster split from choosing a new master unless it is the bigger half of the cluster (which is what you want) and the other half of the cluster won't do anything (it will stop functioning).
So I set this setting, but it did not prevent our problem of getting splits in the 1st place. It only prevented a cluster split from becoming 2 individual clusters with the ghastly effect of having data contents diverging and having to wipe the entire cluster and rebuild the indexes. A good thing to prevent, but it was not what I was looking for.

Next thing was to look at the logs. I know that our splits mostly happened when we restarted a cluster node. It would not be able to rejoin the cluster after it was restarted, even though it worked just fine before the restart. Looking in the logs I could see the node coming up, not finding the other nodes and then just waiting there. I turned up the logging by setting 'discovery: TRACE' in /etc/elasticsearch/logging.yml and saw the node doing 'pings' to the multicast address, but it was not receiving anything back. The logs of the other nodes did not show them receiving the pings.
Using tcpdump I noticed the traffic was indeed being sent out the network interface, but nothing was received by the other nodes. Ah-ha! it's a network problem! (but let me check for any firewall settings real fast first. nope, no firewall blocking this..)


[@td019 ~]$ sudo tcpdump -i eth0 port 54328 -X -n -vvv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:48:37.209282 IP (tos 0x0, ttl 3, id 0, offset 0, flags [DF], proto UDP (17), length 118)
    1.2.2.9.54328 > 224.2.2.4.54328: [bad udp cksum 57a!] UDP, length 90
        0x0000:  4500 0076 0000 4000 0311 d549 ac14 1413  E..v..@....I....
        0x0010:  e002 0204 d438 d438 0062 a2a1 0109 0804  .....8.8.b......
        0x0020:  a7a2 3e00 0000 060d 6367 5f65 735f 636c  ..>.....aa_bb_cl
        0x0030:  7573 7465 7206 426f 6f6d 6572 1677 6c50  uster.Boomer.wlP
        0x0040:  5372 4f72 6751 3571 5452 775a 7034 4371  SrOrgQ5qTRwZp4Cq
        0x0050:  7961 5105 7464 3031 390c 3137 322e 3230  yaQ.td019.172.20
        0x0060:  2e32 302e 3139 0001 0004 ac14 1413 0000  .20.19..........
        0x0070:  2454 00a7 a23e                           $T...>


 Here I see the traffic leaving the host towards the multicast address, but nothing is received. Why is this happening? Maybe it's a bug in elasticsearch. I upgraded from 1.0.1 to 1.2.1, but it made no difference. Lets make sure that it's indeed the network. I downloaded jgroups (www.jgroups.org) and using some googled syntax (http://www.tomecode.com/2010/06/12/jboss-5-clustering-in-firewall-environment-or-how-to-test-the-multicast-address-and-port/) I ran some tests by joining on the same group address and port. And then I started sending test messages from node1 to 2 and 3, from 2 to 1 and 3 and from 3 to 1 and 2. Guess what... with jgroups I could also not get any messages from or to one of the nodes.

These nodes are virtual machines. So I moved that virtual machine to a different host and _boom_, it started working. More proof that it's a network issue. As long as the nodes are connected to the same switch, everything is fine. But when the nodes are on different switches, the IGMP traffic is apparantly not forwarded. That could possibly have to do with some switches doing IGMP snooping to try and be smart about where to send IGMP traffic. Configuring a static  IGMP group in the switches solved the problem.

We can now use elasticsearch without having to worry about getting split-brains when we restart a node. And when a split does occur, it won't hurt too badly because the smaller part of the cluster won't elect a new master.