Shooting bits: oktober 2011

I finally bought myself an SSD. I've been using spinning run for ever and recently we started using SSDs for the servers at work. I just had to go and buy one for myself. The windows boot time was starting to really annoy me.
At €99,- for 60GB it's not cheap, but it should be fast. The box says up to 535MB/sec read and 490MB/sec write. I doubt that'll even be achieven in real world situations. Ofcourse I do want to try :)

So first I hooked it up to one of the ports of my promise sata 300tx4 controller. This was the hdparm result:

/dev/sdl:
Timing cached reads:   2514 MB in 2.00 seconds = 1257.10 MB/sec
Timing buffered disk reads: 274 MB in 3.00 seconds = 91.33 MB/sec

That's... not any faster than my harddisks... :(

I couldn't believe it. So I moved the SSD over to my onboard ati sata controller, and hdparm did something different:
/dev/sdm:
Timing cached reads:   2830 MB in 2.00 seconds = 1414.96 MB/sec
Timing buffered disk reads: 1074 MB in 3.00 seconds = 357.86 MB/sec

That's better! Not quite there yet, but atleast now I get the idea I spent my money wisely.

Especially compared to my OS disk, and a newer 2TB disk I bought:

/dev/hda: (pata disk)
Timing cached reads:   2516 MB in 2.00 seconds = 1257.80 MB/sec
Timing buffered disk reads: 166 MB in 3.02 seconds = 54.90 MB/sec

/dev/disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30TG22D:
Timing cached reads:   2222 MB in 2.00 seconds = 1111.31 MB/sec
Timing buffered disk reads: 300 MB in 3.02 seconds = 99.50 MB/sec

I find this huge performance difference between my 2 sata controllers a bit disconcernting. Especially since the promise controller is a sata-II capable device (300MB/sec), but not even achieving sata-I speeds (150MB/sec). At the same time my onboard sata controller is sata-III capable (600MB/sec) and not achieving that, although it does get a little bit above the sata-II spec. What could cause this?
One limiting factor is the PCI bus that the promise card is on. That limits it to 266MB/sec so that's not the bottleneck. The sata cables according to the sata wikipedia page should all go up to sata-III. Maybe the rest of the system is too slow to keep up? Let's do an easier test:

server ~ # mount /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC /mnt/ssd
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 41.3909 s, 247 MB/s
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 30.1224 s, 340 MB/s
server ~ #

While doing this test the io-utilisation was at 40-60% ('iostat -x 1 sdm) when writing and 75% when reading. So the device could handle more, but it was just waiting for the system. In fact, taking that 247MB/sec and correcting for the IO utilisation, the device indeed seems to be able to handle about 500MB/sec. Wow.
So why isn't the device being 100% utilised? When looking at 'top' while doing these tests it shows one of the 2 cores to be 100% busy when writing to the device, but only 50% busy when reading.
read: 40% waiting and 60% system time. 0% idle.
increasing the blocksize to 1MB increases read speed to 360MB/sec
switching to direct io and nonblocking io the speed increases further:
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000 bs=1048576 iflag=direct,nonblock
9765+1 records in
9765+1 records out
10240000000 bytes (10 GB) copied, 24.5059 s, 418 MB/s

and writing:
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=5000 bs=1048576 oflag=direct,nonblock
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 14.8777 s, 352 MB/s

75% io wait, 23% system time. 75% utilisation

So using a larger blocksize with nonblocking io and direct read/write the io increases slightly. But we're still not where whe should be.

Let's cut ext4 out of the loop:
server ~ # dd if=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC of=/dev/null count=20000000 bs=1048576 iflag=direct,nonblock
57241+1 records in
57241+1 records out
60022480896 bytes (60 GB) copied, 136.918 s, 438 MB/s
server ~ #
this results in 90% utilisation and 87% cpu waiting on io, with 12% cpu system time.

write:
server ~ # dd if=/dev/zero of=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC count=20000 bs=1048576 oflag=direct,nonblock skip=2
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB) copied, 58.2396 s, 360 MB/s
server ~ #
80% util, 75% wait, 25% system

Based on all if this I think the performance of this SSD is CPU bound.
The difference between the promise and the onboard sata controller may be related to AHCI: the stock kernel driver for promise doesn't support AHCI.

Now that I'm done, I'm cleaning up the SSD before handing it over to my windows pc:
i=0; while [ $i -lt 117231408 ]; do echo $i:64000; i=$(((i+64000))); done | hdparm --trim-sector-ranges-stdin --please-destroy-my-drive /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC
This makes sure that all data is 'trimmed' aka removed without writing to the device. This keeps the SSD fast, because a write to a block of data that already has data in it is much slower that a write to a block that is empty, aka trimmed.

This is what bonnie has to say:

Version 1.96			Sequential Output						Sequential Input				Random Seeks			Sequential Create						Random Create
Concurrency		Size	Per Char		Block		Rewrite		Per Char		Block				Num Files	Create		Read		Delete		Create		Read		Delete
			K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	/sec	% CPU		/sec	% CPU	/sec	% CPU	/sec	% CPU	/sec	% CPU	/sec	% CPU	/sec	% CPU
hda-pata	4	6576M	640	94	26899	11	14683	4	1828	98	45783	6	145.1	7	16	8091	9	+++++	+++	23202	22	30607	40	+++++	+++	+++++	+++
	Latency		36690us		2432ms		4584ms		17325us		229ms		2529ms		Latency	12317us		849us		828us		107us		67us		73us
hda-pata	4	6576M	681	93	26794	10	14465	4	1453	98	43385	6	135.2	7	16	6404	25	+++++	+++	23371	60	18094	66	+++++	+++	24027	60
	Latency		13723us		2087ms		3965ms		20442us		267ms		2308ms		Latency	13195us		2236us		2291us		252us		136us		67us
hda-pata	4	6576M	492	96	26432	10	14338	4	1921	94	39876	4	120.6	6	16	4261	4	+++++	+++	17685	13	21240	21	+++++	+++	23140	17
	Latency		38599us		2195ms		2586ms		20359us		404ms		2291ms		Latency	12161us		633us		895us		1212us		257us		3930us
zfs-2	4	6576M	20	22	18509	11	15498	10	1598	98	64493	8	142.7	4	16	1362	21	+++++	+++	6838	23	2714	21	+++++	+++	8875	28
	Latency		940ms		2100ms		2053ms		27966us		268ms		1879ms		Latency	134ms		6559us		4142us		66330us		105us		10113us
zfs-2	4	6576M	30	19	19664	12	15545	10	998	98	65135	7	164.6	5	16	2227	19	30995	8	6253	24	3056	18	+++++	+++	9731	22
	Latency		432ms		1860ms		2054ms		25254us		218ms		1825ms		Latency	97645us		2929us		10128us		76339us		82us		18389us
zfs-2	4	6576M	31	19	18794	12	15585	10	1655	98	59008	9	165.1	5	16	1337	22	27141	17	2681	26	1413	20	+++++	+++	8789	27
	Latency		384ms		1890ms		1993ms		21819us		236ms		1747ms		Latency	61531us		3114us		7322us		339ms		105us		42091us
ssd-promise	4	6576M	604	99	97059	24	47298	15	1700	99	119279	19	5231	220	16	15393	17	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		38396us		721ms		804ms		10871us		9170us		81333us		Latency	83us		531us		549us		97us		8us		41us
ssd-promise	4	6576M	766	98	97783	24	47870	15	1810	99	120092	20	4713	200	16	13357	15	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		11147us		661ms		787ms		11471us		3283us		97766us		Latency	210us		531us		547us		85us		10us		46us
ssd-promise	4	6576M	774	98	97767	24	47757	15	1464	99	120094	20	4672	191	16	10697	19	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		11036us		226ms		775ms		12233us		2866us		91878us		Latency	439us		531us		549us		90us		9us		64us
ssd-ati	4	6576M	648	99	454837	43	201613	29	1316	99	482448	35	12924	381	16	12573	14	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		38218us		175ms		109ms		12974us		124ms		3048us		Latency	90us		524us		556us		87us		7us		30us
ssd-ati	4	6576M	669	99	466455	47	202751	28	2659	99	475295	38	11540	333	16	15038	17	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		38867us		127ms		111ms		4088us		121ms		2753us		Latency	256us		524us		558us		91us		63us		63us
ssd-ati	4	6576M	625	99	467713	50	176020	57	1345	99	484839	47	11420	342	16	21595	24	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		39029us		113ms		54983us		11317us		74753us		4429us		Latency	207us		534us		573us		86us		65us		66us

As you can see, the SSD is much, much faster than any disk. ZFS is reaaaalllly slow, but that's because I'm running it through fuse so all IO needs to go through userspace. But that's ok, I don't need speed on those disks. Only safety. And once zfsonlinux gets around to finishing their code it should speed up tremendously.

If I get to buy a faster PC I'll update this doc with new measurements.

Let's say you've got a couple of photo's you'd like to keep. Actually as the years have gone by, it's more like a few hundred, perhaps a few thousand photos you've collected and would really hate to lose. And let's assume you've lost data before due to a harddisk crash or otherwise. You swore never to be troubled by that problem again, so you set up a mirror, being raid-1, on your computer.
And let's assume you use Linux to manage that. Windows could too, but in a minute I'll show you why windows is no good for a mirror.

It all has to do with bit rot. You see, data on your disks isn't safe. It's not just because disks can die and you won't be able to access them. It's that data on your disks slowly rots and 1's can turn into 0's. 0's can turn into 1's. That makes your data corrupt. This happens because these days the bits on harddisks are so tiny that small influences can change them. Possibly cosmic rays. Or a magnet slightly too close to the disk. Perhaps the magnetic field of one of the computer fans. Or more likely a small patch of magnetic substrate that was just a bit off at production and isn't able to hold its contents for more than a month. What ever the reason, the fact is that data slowly rots.

Now you have a mirror. A raid set. Raid-1 even. You're not safe. But you're a bit safer. What you could do is have your computer run a check on the 2 disks, to make sure that the contents of the disks is still exactly the same. You could have it run every night. This is called a 'data scrub'.

And then one night, maybe a year from now, your computer will beep and state it has found a different. Perhaps it's only one bit. But what now? Which of the disks is right? Disk A says it's a 0. Disk B says it's a 1. If you're lucky you have backups and you can restore the file. If not... it's anyone's guess. Oh did I mention that backups can suffer from the same problem ? This is not just limited to harddisks. CD's rot too. So do tapes, floppy's, zip's, jazz's, etc.

Now, please, amuse me. Go a google for 'windows data scrub' and do one for 'linux data scrub'. Notice any difference? There are no data scrubbing tools for windows. But there are for Linux. The reason is that when Windows finds a difference, it will use the 'master' disk as the source to write to the 'slave' disk. Even when the master is wrong.

So under windows having a mirror is not much help. You won't even notice that there was a problem. Under linux you can have the scrub tell you there was a problem, and you can decide to do a file restore from backup.

If you are using raid-5 instead of raid-1 you have the added benefit of being able to repair the bad data. Read here on how to set up scrubbing for md-raid software raid setups under Linux.

And if you're really smart you're using a checksumming filesystem, such as ZFS or BTRFS. These calculate a checksum of the data and by that allow the OS to figure out which of the two disks has the correct data, and overwrite the bad data with the good data. Automatically. Problem fixed.

If you are not scrubbing yet, you don't value your photo's enough.

Shooting bits

donderdag 20 oktober 2011

ocz vertex 3 linux performance: promise vs onboard ATI SB700/SB800

dinsdag 18 oktober 2011

scrubbing bits under linux; making sure your data is safe