donderdag 20 oktober 2011

ocz vertex 3 linux performance: promise vs onboard ATI SB700/SB800

I finally bought myself an SSD. I've been using spinning run for ever and recently we started using SSDs for the servers at work. I just had to go and buy one for myself. The windows boot time was starting to really annoy me.
At €99,- for 60GB it's not cheap, but it should be fast. The box says up to 535MB/sec read and 490MB/sec write. I doubt that'll even be achieven in real world situations. Ofcourse I do want to try :)

So first I hooked it up to one of the ports of my promise sata 300tx4 controller. This was the hdparm result:

/dev/sdl:
 Timing cached reads:   2514 MB in  2.00 seconds = 1257.10 MB/sec
 Timing buffered disk reads: 274 MB in  3.00 seconds =  91.33 MB/sec

That's... not any faster than my harddisks... :(


I couldn't believe it. So I moved the SSD over to my onboard ati sata controller, and hdparm did something different:
/dev/sdm:
 Timing cached reads:   2830 MB in  2.00 seconds = 1414.96 MB/sec
 Timing buffered disk reads: 1074 MB in  3.00 seconds = 357.86 MB/sec

That's better! Not quite there yet, but atleast now I get the idea I spent my money wisely.

Especially compared to my OS disk, and a newer 2TB disk I bought:

/dev/hda: (pata disk)
 Timing cached reads:   2516 MB in  2.00 seconds = 1257.80 MB/sec
 Timing buffered disk reads: 166 MB in  3.02 seconds =  54.90 MB/sec

/dev/disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30TG22D:
 Timing cached reads:   2222 MB in  2.00 seconds = 1111.31 MB/sec
 Timing buffered disk reads: 300 MB in  3.02 seconds =  99.50 MB/sec

I find this huge performance difference between my 2 sata controllers a bit disconcernting. Especially since the promise controller is a sata-II capable device (300MB/sec), but not even achieving sata-I speeds (150MB/sec). At the same time my onboard sata controller is sata-III capable (600MB/sec) and not achieving that, although it does get a little bit above the sata-II spec. What could cause this?
One limiting factor is the PCI bus that the promise card is on. That limits it to 266MB/sec so that's not the bottleneck. The sata cables according to the sata wikipedia page should all go up to sata-III. Maybe the rest of the system is too slow to keep up? Let's do an easier test:

server ~ # mount /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC /mnt/ssd
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 41.3909 s, 247 MB/s
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 30.1224 s, 340 MB/s
server ~ #

While doing this test the io-utilisation was at 40-60% ('iostat -x 1 sdm) when writing and 75% when reading. So the device could handle more, but it was just waiting for the system. In fact, taking that 247MB/sec and correcting for the IO utilisation, the device indeed seems to be able to handle about 500MB/sec. Wow.
So why isn't the device being 100% utilised? When looking at 'top' while doing these tests it shows one of the 2 cores to be 100% busy when writing to the device, but only 50% busy when reading.
read: 40% waiting and 60% system time. 0% idle.
increasing the blocksize to 1MB increases read speed to 360MB/sec
switching to direct io and nonblocking io the speed increases further:
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000 bs=1048576  iflag=direct,nonblock
9765+1 records in
9765+1 records out
10240000000 bytes (10 GB) copied, 24.5059 s, 418 MB/s

and writing:
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=5000 bs=1048576 oflag=direct,nonblock
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 14.8777 s, 352 MB/s

75% io wait, 23% system time. 75% utilisation

So using a larger blocksize with nonblocking io and direct read/write the io increases slightly. But we're still not where whe should be.

Let's cut ext4 out of the loop:
server ~ # dd if=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC of=/dev/null count=20000000 bs=1048576 iflag=direct,nonblock
57241+1 records in
57241+1 records out
60022480896 bytes (60 GB) copied, 136.918 s, 438 MB/s
server ~ #
 this results in 90% utilisation and 87% cpu waiting on io, with 12% cpu system time.

write:
server ~ # dd if=/dev/zero of=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC count=20000 bs=1048576 oflag=direct,nonblock skip=2
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB) copied, 58.2396 s, 360 MB/s
server ~ #
80% util, 75% wait, 25% system

Based on all if this I think the performance of this SSD is CPU bound.
The difference between the promise and the onboard sata controller may be related to AHCI: the stock kernel driver for promise doesn't support AHCI.

Now that I'm done, I'm cleaning up the SSD before handing it over to my windows pc:
i=0; while [ $i -lt 117231408 ]; do echo $i:64000; i=$(((i+64000))); done  | hdparm --trim-sector-ranges-stdin --please-destroy-my-drive /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC
This makes sure that all data is 'trimmed' aka removed without writing to the device. This keeps the SSD fast, because a write to a block of data that already has data in it is much slower that a write to a block that is empty, aka trimmed.

This is what bonnie has to say:
Version 1.96Sequential OutputSequential InputRandom
Seeks
Sequential CreateRandom Create
ConcurrencySizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU
hda-pata46576M640942689911146834182898457836145.171680919++++++++23202223060740++++++++++++++++
Latency36690us2432ms4584ms17325us229ms2529msLatency12317us849us828us107us67us73us
hda-pata46576M681932679410144654145398433856135.2716640425++++++++23371601809466++++++++2402760
Latency13723us2087ms3965ms20442us267ms2308msLatency13195us2236us2291us252us136us67us
hda-pata46576M492962643210143384192194398764120.661642614++++++++17685132124021++++++++2314017
Latency38599us2195ms2586ms20359us404ms2291msLatency12161us633us895us1212us257us3930us
zfs-246576M202218509111549810159898644938142.7416136221++++++++683823271421++++++++887528
Latency940ms2100ms2053ms27966us268ms1879msLatency134ms6559us4142us66330us105us10113us
zfs-246576M30191966412155451099898651357164.6516222719309958625324305618++++++++973122
Latency432ms1860ms2054ms25254us218ms1825msLatency97645us2929us10128us76339us82us18389us
zfs-246576M311918794121558510165598590089165.15161337222714117268126141320++++++++878927
Latency384ms1890ms1993ms21819us236ms1747msLatency61531us3114us7322us339ms105us42091us
ssd-promise46576M6049997059244729815170099119279195231220161539317++++++++++++++++++++++++++++++++++++++++
Latency38396us721ms804ms10871us9170us81333usLatency83us531us549us97us8us41us
ssd-promise46576M7669897783244787015181099120092204713200161335715++++++++++++++++++++++++++++++++++++++++
Latency11147us661ms787ms11471us3283us97766usLatency210us531us547us85us10us46us
ssd-promise46576M7749897767244775715146499120094204672191161069719++++++++++++++++++++++++++++++++++++++++
Latency11036us226ms775ms12233us2866us91878usLatency439us531us549us90us9us64us
ssd-ati46576M6489945483743201613291316994824483512924381161257314++++++++++++++++++++++++++++++++++++++++
Latency38218us175ms109ms12974us124ms3048usLatency90us524us556us87us7us30us
ssd-ati46576M6699946645547202751282659994752953811540333161503817++++++++++++++++++++++++++++++++++++++++
Latency38867us127ms111ms4088us121ms2753usLatency256us524us558us91us63us63us
ssd-ati46576M6259946771350176020571345994848394711420342162159524++++++++++++++++++++++++++++++++++++++++
Latency39029us113ms54983us11317us74753us4429usLatency207us534us573us86us65us66us

As you can see, the SSD is much, much faster than any disk. ZFS is reaaaalllly slow, but that's because I'm running it through fuse so all IO needs to go through userspace. But that's ok, I don't need speed on those disks. Only safety. And once zfsonlinux gets around to finishing their code it should speed up tremendously.

If I get to buy a faster PC I'll update this doc with new measurements.

dinsdag 18 oktober 2011

scrubbing bits under linux; making sure your data is safe

Let's say you've got a couple of photo's you'd like to keep. Actually as the years have gone by, it's more like a few hundred, perhaps a few thousand photos you've collected and would really hate to lose. And let's assume you've lost data before due to a harddisk crash or otherwise. You swore never to be troubled by that problem again, so you set up a mirror, being raid-1, on your computer.
And let's assume you use Linux to manage that. Windows could too, but in a minute I'll show you why windows is no good for a mirror.

It all has to do with bit rot. You see, data on your disks isn't safe. It's not just because disks can die and you won't be able to access them. It's that data on your disks slowly rots and 1's can turn into 0's. 0's can turn into 1's. That makes your data corrupt. This happens because these days the bits on harddisks are so tiny that small influences can change them. Possibly cosmic rays. Or a magnet slightly too close to the disk. Perhaps the magnetic field of one of the computer fans. Or more likely a small patch of magnetic substrate that was just a bit off at production and isn't able to hold its contents for more than a month. What ever the reason, the fact is that data slowly rots.

Now you have a mirror. A raid set. Raid-1 even. You're not safe. But you're a bit safer. What you could do is have your computer run a check on the 2 disks, to make sure that the contents of the disks is still exactly the same. You could have it run every night. This is called a 'data scrub'.

 And then one night, maybe a year from now, your computer will beep and state it has found a different. Perhaps it's only one bit. But what now? Which of the disks is right? Disk A says it's a 0. Disk B says it's a 1. If you're lucky you have backups and you can restore the file. If not... it's anyone's guess. Oh did I mention that backups can suffer from the same problem ? This is not just limited to harddisks. CD's rot too. So do tapes, floppy's, zip's, jazz's, etc.

Now, please, amuse me. Go a google for 'windows data scrub' and do one for 'linux data scrub'. Notice any difference? There are no data scrubbing tools for windows. But there are for Linux. The reason is that when Windows finds a difference, it will use the 'master' disk as the source to write to the 'slave' disk. Even when the master is wrong.

So under windows having a mirror is not much help. You won't even notice that there was a problem. Under linux you can have the scrub tell you there was a problem, and you can decide to do a file restore from backup.

If you are using raid-5 instead of raid-1 you have the added benefit of being able to repair the bad data. Read here on how to set up scrubbing for md-raid software raid setups under Linux.

And if you're really smart you're using a checksumming filesystem, such as ZFS or BTRFS. These calculate a checksum of the data and by that allow the OS to figure out which of the two disks has the correct data, and overwrite the bad data with the good data. Automatically. Problem fixed.

If you are not scrubbing yet, you don't value your photo's enough.