Shooting bits: 2014

woensdag 1 oktober 2014

measurements while building the O2 objective headphone amp

While building the (amazing) O2 objective amp and dac I did some measurements before powering up the unit. This is strongly advised by the guides because it allows me to detect problems and mistakes before adding power, thus hopefully preventing costly burnouts or worse, explosions.
Unfortunately most of the measurements I made were way off from the ones listed in the guides. Here is what I measured:
R1&R2 476 instead of 100-220
R5 191k instead of 100k
R15/18/10/11 240 instead of 1
R16/22 1.55k instead of 1.3k
R7/14/3/20 500 instead of 100-300
(all values are in Ohms)

The measurement device I use is a somewhat expensive voltcraft digital multimeter with a high sampling rate. My suspicion is that this specific MultiMeter is outputting amps while measuring ohms without allowing the amps to drain out before doing another sample. That could result in a higher ohms measurement due to capacitance in the circuit. Unfortunately I did not have a different (digital) MM available to do any measurements with. After checking and triplechecking the placement of the components and the soldering joints I decided to try the unit.
And it works beautifully!

I compared it to all the amplifiers and dacs I have and for its cheap price it could only be matched by equipment 4 times its price or above. I find that to be very good value for money.

I hope this post helps if ever you build your own O2.

vrijdag 4 juli 2014

migrating backuppc from ext4 to glusterfs with xfs

The servers at the company I currently work for are backed up using backuppc. It does its job nicely, the interface is reasonable and we hardly ever have problems with it. But As time went by the number of servers grew and so did the number of files. We now have over 500.000.000 files in the backuppc pool and the backup totaled over 13TB, although after some cleanup we are now at around 11TB. When we hit 13TB our monitoring system went off. It was because we created the filesystem at around 14TB. So we set out to expand it, only to figure out that we couldn't do much. The maximum size of an ext4 filesystem is 16TiB, which is about 17.5TB on a system with 4k blocks. The backups are stored on a 10 disk raid-6 19.8TB (18TiB) volume. That means we are wasting about 2TB of space by using ext4, and we couldn't expand our backup volume beyond that at all. That was problem 1.
We had 2 backup servers, one on-site and one off-site. In case something happens in the primary datacenter, such as a fire, we would still have a reasonably recent backup available to restore to new servers, the cloud or whatever. All data from the primary backup server is synced to the remote machine using rsync over ssh over our office internet connection. That connection used to be a cable modem capable of doing around 60mbit. Yet we found out that we had a nightly change set of about 2TB worth of data. Synching that across a 60mbit connection takes about 4 days if we were able to achieve a constant 60 mbit. But the connection is shared with our employees who need to do things like skype calls, browse the weba nd send emails. That's problem 2.
We noticed that even though there was 60mbit to go around, we were not achieving that throughput even when we wanted to. Analysis in our monitoring system showed that the data source was utilizing 100% of the available disk io's. The reason is that we were using rsync to transfer the data between the source and the destination. Rsync is a great tool to sync files and even full directories, because it calculates deltas and only transfers the changes. But doing this over ssh (default these days) and for a direcory tree that has over 500 million files is crazy. That wouldn't be too bad by itself but since backuppc uses hardlinks to save space, rsync's memory useage would explode. Eventually the system would run out of memory and kill rsync just to stay alive. It would consume every resource it had. That's problem 3. We added memory to try and fix this.
But it did not improve the speed, because calculating the md5 hash for many small files is expensive. And many small files cause many small io's, which can't all be grouped together by the OS or the raid controller. And the raid-penalty of raid-6 hurt a lot also. So there we have problem 4, way too may iops were needed.

We first fixed problem 2, by upgrading our connection to the datacenter to a 500mbit fiber connection. That increased the available bandwidth enough to send the full 2TB change set within a day. We could even sync the full server contents within a day.

Problems 1 and 4 were harder to fix. To fix problem 4 we need to switch to block based backups/transfers, which allow the IO to be sequential and to be done in much larger blocks, drastically improving throughput. But problem 1 is more urgent, we must have more space to store the backups.

We ordered a server with 48 disk slots and created a ZFS filesystem on it. We only filled the server for 50% with disks, 4TB each, and created a 50TB zfs volume. This we host in the offsite location. ZFS gives us a bit slower read and writes, but with this many disks the throughput will still be much higher than what we have now and much higher than what we can offer the system across the fiber link, even if we were to upgrade the connection to more bandwidth. This allows us to store a number of block-based backups from the datacenter so that if accidents happen within the backups in the datacenter, which are synced to the offsite location, we can still revert to older versions of the backup.
It also allows us to take the existing off-site backup server out of production and move it to the datacenter, where we can set it up as a glusterfs brick. Since glusterfs uses XFS as its filesystem it allows 16ExiBytes of data. I hope we will ever reach that much data, but it'll probable be lots of years before we do ;) Using glusterFS also allows us to scale our storage needs in the datacenter much more easilly, both performance wise and size wise. We can just add extra 'bricks' when we need to.
Unfortunately glusterFS needs 2 bricks as a minimum before it can start, and only 1 brick is available at this time. We need to migrate the data from our existing dc backup server to our zfs machine and verify it works correctly. Remember the problem (4) with rsync and this many files? The standard way of transferring a backuppc pool (with BackupPC_tarPCCopy) does not work for us due to errors. Instead we were able to transfer the backuppc pool using a tool created by a member of the backuppc community: BackupPC_CopyPcPool.pl

We are now at the point where we will soon mount the copy'd date that lives on the remote ZFS server on our datacenter backup server, to test whether the pool transferred correctly and is still usable. If that works we can turn the datacenter backup node into a glusterfs brick, install backuppc on one of the glusterfs nodes and mount the cluster volume. We copy the data from the ZFS machine back to the DC and we can then turn the backups back on. We then have problem 1 solved.

For problem 4 we are looking into replacing rsync with a blockbased backup solution. We tested appassure, but it would crash due to the amount of files and hardlinks. Dell is working on improving their software, but they say that it's mostly because of our use of ext4. We'll retry AppAssure when we've completed the switch to glusterFS/XFS.
For the longer term the question arises whether BackupPC is really enterprise ready. We've found there are no tools to guarantee the consistency of the backup pool and very few tools to fix problems. Work is being done on backuppc 4, which may or may not improve the situation. For creating the primary backup of our servers we'll also be looking into other opensource solutions such as bacula and amanda. Perhaps they are better at this job.

[update 2016-04-24: I've written a followup here: http://shootingbits.blogspot.com/2016/04/why-i-moved-away-from-backuppc-for.html]

donderdag 3 juli 2014

puppet-mongodb and the story of the missing provider

So I found myself setting up this mongoDB cluster with puppet. It turns out puppetlabs has its own mongodb module, so I started using that. The documentation told me to set up the class from the nodes definition like such:

class {'::mongodb::globals':
manage_package_repo => true,
}->
class {'::mongodb::server':
auth => false,
ensure => present,
rest => true,
dbpath => "/e/data/mongodb",
logpath => "/e/logs/mongodb/mongodb.log",
bind_ip => ["127.0.0.1" , "x.x.x.x"],
}->
mongodb_database { 'feed':
ensure => present,
tries => 10,
require => Class['mongodb::server'],
}

and that installed mongodb-org-server and started it. great. But every puppet run it would give me this message:
Error: Could not find a suitable provider for mongodb_database

And no matter what I tried, I could not get this fixed. Googling and reading puppet docs did not seem to help. There were no incidents or forks of the puppetlabs mongodb module that seemed to address this. In the end I asked the puppet irc channel for help, and it quickly turned out that in order for the module to create a database, the module needs the mongo command. And that command is provided (by installing the mongodb-org-tools rpm) by the mongodb::client class. Which I did not have defined, because the documentation did not specify I would. Adding that, the nodes definition became:

class {'::mongodb::globals':
manage_package_repo => true,
}->
class {'::mongodb::server':
auth => false,
ensure => present,
rest => true,
dbpath => "/e/data/mongodb",
logpath => "/e/logs/mongodb/mongodb.log",
bind_ip => ["127.0.0.1" , "x.x.x.x"],
}->
class {'mongodb::client':}->
mongodb_database { 'feed':
ensure => present,
tries => 10,
require => Class['mongodb::server'],
}

And then things started working.

vrijdag 6 juni 2014

Elasticsearch multicast debugging

Elasticsearch enables fast searches through vast amounts of data. It's a scalable storage system for objects.
I was having trouble with a cluster that would work initially but later it would lose one or more nodes and result in a split-brain situation. This blog article explains how I fixed this.

There are 2 ways to build an elasticsearch cluster:
- unicast
- multicast

With unicast you provide a list of IP addresses or hostnames of each of the nodes that is to be a part of the custer. Each node needs to know atleast part of the list of nodes in order to find the master node.
With multicast you don't need to configure anything, each node subscribes to a unicast group and listens to what any other node is announcing to that same group. Each node joins a cluster by advertising itself to a multicast address (224.2.2.4 by default with port 54328), where all the nodes have joined. We are using elasticsearch with multicast enabled.

Searching for the terms 'elasticsearch' and 'split-brain' you'll find many people with similar problems. The 1st thing to note is that there is a setting called discovery.zen.minimum_master_nodes and it is not set by default. Setting this N/2+1 is a good idea, because it will prevent a cluster split from choosing a new master unless it is the bigger half of the cluster (which is what you want) and the other half of the cluster won't do anything (it will stop functioning).
So I set this setting, but it did not prevent our problem of getting splits in the 1st place. It only prevented a cluster split from becoming 2 individual clusters with the ghastly effect of having data contents diverging and having to wipe the entire cluster and rebuild the indexes. A good thing to prevent, but it was not what I was looking for.

Next thing was to look at the logs. I know that our splits mostly happened when we restarted a cluster node. It would not be able to rejoin the cluster after it was restarted, even though it worked just fine before the restart. Looking in the logs I could see the node coming up, not finding the other nodes and then just waiting there. I turned up the logging by setting 'discovery: TRACE' in /etc/elasticsearch/logging.yml and saw the node doing 'pings' to the multicast address, but it was not receiving anything back. The logs of the other nodes did not show them receiving the pings.
Using tcpdump I noticed the traffic was indeed being sent out the network interface, but nothing was received by the other nodes. Ah-ha! it's a network problem! (but let me check for any firewall settings real fast first. nope, no firewall blocking this..)

[@td019 ~]$ sudo tcpdump -i eth0 port 54328 -X -n -vvv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:48:37.209282 IP (tos 0x0, ttl 3, id 0, offset 0, flags [DF], proto UDP (17), length 118)
1.2.2.9.54328 > 224.2.2.4.54328: [bad udp cksum 57a!] UDP, length 90
0x0000: 4500 0076 0000 4000 0311 d549 ac14 1413 E..v..@....I....
0x0010: e002 0204 d438 d438 0062 a2a1 0109 0804 .....8.8.b......
0x0020: a7a2 3e00 0000 060d 6367 5f65 735f 636c ..>.....aa_bb_cl
0x0030: 7573 7465 7206 426f 6f6d 6572 1677 6c50 uster.Boomer.wlP
0x0040: 5372 4f72 6751 3571 5452 775a 7034 4371 SrOrgQ5qTRwZp4Cq
0x0050: 7961 5105 7464 3031 390c 3137 322e 3230 yaQ.td019.172.20
0x0060: 2e32 302e 3139 0001 0004 ac14 1413 0000 .20.19..........
0x0070: 2454 00a7 a23e $T...>

Here I see the traffic leaving the host towards the multicast address, but nothing is received. Why is this happening? Maybe it's a bug in elasticsearch. I upgraded from 1.0.1 to 1.2.1, but it made no difference. Lets make sure that it's indeed the network. I downloaded jgroups (www.jgroups.org) and using some googled syntax (http://www.tomecode.com/2010/06/12/jboss-5-clustering-in-firewall-environment-or-how-to-test-the-multicast-address-and-port/) I ran some tests by joining on the same group address and port. And then I started sending test messages from node1 to 2 and 3, from 2 to 1 and 3 and from 3 to 1 and 2. Guess what... with jgroups I could also not get any messages from or to one of the nodes.

These nodes are virtual machines. So I moved that virtual machine to a different host and _boom_, it started working. More proof that it's a network issue. As long as the nodes are connected to the same switch, everything is fine. But when the nodes are on different switches, the IGMP traffic is apparantly not forwarded. That could possibly have to do with some switches doing IGMP snooping to try and be smart about where to send IGMP traffic. Configuring a static IGMP group in the switches solved the problem.

We can now use elasticsearch without having to worry about getting split-brains when we restart a node. And when a split does occur, it won't hurt too badly because the smaller part of the cluster won't elect a new master.