Shooting bits: 2011

donderdag 20 oktober 2011

ocz vertex 3 linux performance: promise vs onboard ATI SB700/SB800

I finally bought myself an SSD. I've been using spinning run for ever and recently we started using SSDs for the servers at work. I just had to go and buy one for myself. The windows boot time was starting to really annoy me.
At €99,- for 60GB it's not cheap, but it should be fast. The box says up to 535MB/sec read and 490MB/sec write. I doubt that'll even be achieven in real world situations. Ofcourse I do want to try :)

So first I hooked it up to one of the ports of my promise sata 300tx4 controller. This was the hdparm result:

/dev/sdl:
Timing cached reads:   2514 MB in 2.00 seconds = 1257.10 MB/sec
Timing buffered disk reads: 274 MB in 3.00 seconds = 91.33 MB/sec

That's... not any faster than my harddisks... :(

I couldn't believe it. So I moved the SSD over to my onboard ati sata controller, and hdparm did something different:
/dev/sdm:
Timing cached reads:   2830 MB in 2.00 seconds = 1414.96 MB/sec
Timing buffered disk reads: 1074 MB in 3.00 seconds = 357.86 MB/sec

That's better! Not quite there yet, but atleast now I get the idea I spent my money wisely.

Especially compared to my OS disk, and a newer 2TB disk I bought:

/dev/hda: (pata disk)
Timing cached reads:   2516 MB in 2.00 seconds = 1257.80 MB/sec
Timing buffered disk reads: 166 MB in 3.02 seconds = 54.90 MB/sec

/dev/disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30TG22D:
Timing cached reads:   2222 MB in 2.00 seconds = 1111.31 MB/sec
Timing buffered disk reads: 300 MB in 3.02 seconds = 99.50 MB/sec

I find this huge performance difference between my 2 sata controllers a bit disconcernting. Especially since the promise controller is a sata-II capable device (300MB/sec), but not even achieving sata-I speeds (150MB/sec). At the same time my onboard sata controller is sata-III capable (600MB/sec) and not achieving that, although it does get a little bit above the sata-II spec. What could cause this?
One limiting factor is the PCI bus that the promise card is on. That limits it to 266MB/sec so that's not the bottleneck. The sata cables according to the sata wikipedia page should all go up to sata-III. Maybe the rest of the system is too slow to keep up? Let's do an easier test:

server ~ # mount /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC /mnt/ssd
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 41.3909 s, 247 MB/s
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000
20000000+0 records in
20000000+0 records out
10240000000 bytes (10 GB) copied, 30.1224 s, 340 MB/s
server ~ #

While doing this test the io-utilisation was at 40-60% ('iostat -x 1 sdm) when writing and 75% when reading. So the device could handle more, but it was just waiting for the system. In fact, taking that 247MB/sec and correcting for the IO utilisation, the device indeed seems to be able to handle about 500MB/sec. Wow.
So why isn't the device being 100% utilised? When looking at 'top' while doing these tests it shows one of the 2 cores to be 100% busy when writing to the device, but only 50% busy when reading.
read: 40% waiting and 60% system time. 0% idle.
increasing the blocksize to 1MB increases read speed to 360MB/sec
switching to direct io and nonblocking io the speed increases further:
server ~ # dd if=/mnt/ssd/zero-write of=/dev/null count=20000000 bs=1048576 iflag=direct,nonblock
9765+1 records in
9765+1 records out
10240000000 bytes (10 GB) copied, 24.5059 s, 418 MB/s

and writing:
server ~ # dd if=/dev/zero of=/mnt/ssd/zero-write count=5000 bs=1048576 oflag=direct,nonblock
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 14.8777 s, 352 MB/s

75% io wait, 23% system time. 75% utilisation

So using a larger blocksize with nonblocking io and direct read/write the io increases slightly. But we're still not where whe should be.

Let's cut ext4 out of the loop:
server ~ # dd if=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC of=/dev/null count=20000000 bs=1048576 iflag=direct,nonblock
57241+1 records in
57241+1 records out
60022480896 bytes (60 GB) copied, 136.918 s, 438 MB/s
server ~ #
this results in 90% utilisation and 87% cpu waiting on io, with 12% cpu system time.

write:
server ~ # dd if=/dev/zero of=/dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC count=20000 bs=1048576 oflag=direct,nonblock skip=2
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB) copied, 58.2396 s, 360 MB/s
server ~ #
80% util, 75% wait, 25% system

Based on all if this I think the performance of this SSD is CPU bound.
The difference between the promise and the onboard sata controller may be related to AHCI: the stock kernel driver for promise doesn't support AHCI.

Now that I'm done, I'm cleaning up the SSD before handing it over to my windows pc:
i=0; while [ $i -lt 117231408 ]; do echo $i:64000; i=$(((i+64000))); done | hdparm --trim-sector-ranges-stdin --please-destroy-my-drive /dev/disk/by-id/ata-OCZ-VERTEX3_OCZ-797F5Y0407KD83KC
This makes sure that all data is 'trimmed' aka removed without writing to the device. This keeps the SSD fast, because a write to a block of data that already has data in it is much slower that a write to a block that is empty, aka trimmed.

This is what bonnie has to say:

Version 1.96			Sequential Output						Sequential Input				Random Seeks			Sequential Create						Random Create
Concurrency		Size	Per Char		Block		Rewrite		Per Char		Block				Num Files	Create		Read		Delete		Create		Read		Delete
			K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	/sec	% CPU		/sec	% CPU	/sec	% CPU	/sec	% CPU	/sec	% CPU	/sec	% CPU	/sec	% CPU
hda-pata	4	6576M	640	94	26899	11	14683	4	1828	98	45783	6	145.1	7	16	8091	9	+++++	+++	23202	22	30607	40	+++++	+++	+++++	+++
	Latency		36690us		2432ms		4584ms		17325us		229ms		2529ms		Latency	12317us		849us		828us		107us		67us		73us
hda-pata	4	6576M	681	93	26794	10	14465	4	1453	98	43385	6	135.2	7	16	6404	25	+++++	+++	23371	60	18094	66	+++++	+++	24027	60
	Latency		13723us		2087ms		3965ms		20442us		267ms		2308ms		Latency	13195us		2236us		2291us		252us		136us		67us
hda-pata	4	6576M	492	96	26432	10	14338	4	1921	94	39876	4	120.6	6	16	4261	4	+++++	+++	17685	13	21240	21	+++++	+++	23140	17
	Latency		38599us		2195ms		2586ms		20359us		404ms		2291ms		Latency	12161us		633us		895us		1212us		257us		3930us
zfs-2	4	6576M	20	22	18509	11	15498	10	1598	98	64493	8	142.7	4	16	1362	21	+++++	+++	6838	23	2714	21	+++++	+++	8875	28
	Latency		940ms		2100ms		2053ms		27966us		268ms		1879ms		Latency	134ms		6559us		4142us		66330us		105us		10113us
zfs-2	4	6576M	30	19	19664	12	15545	10	998	98	65135	7	164.6	5	16	2227	19	30995	8	6253	24	3056	18	+++++	+++	9731	22
	Latency		432ms		1860ms		2054ms		25254us		218ms		1825ms		Latency	97645us		2929us		10128us		76339us		82us		18389us
zfs-2	4	6576M	31	19	18794	12	15585	10	1655	98	59008	9	165.1	5	16	1337	22	27141	17	2681	26	1413	20	+++++	+++	8789	27
	Latency		384ms		1890ms		1993ms		21819us		236ms		1747ms		Latency	61531us		3114us		7322us		339ms		105us		42091us
ssd-promise	4	6576M	604	99	97059	24	47298	15	1700	99	119279	19	5231	220	16	15393	17	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		38396us		721ms		804ms		10871us		9170us		81333us		Latency	83us		531us		549us		97us		8us		41us
ssd-promise	4	6576M	766	98	97783	24	47870	15	1810	99	120092	20	4713	200	16	13357	15	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		11147us		661ms		787ms		11471us		3283us		97766us		Latency	210us		531us		547us		85us		10us		46us
ssd-promise	4	6576M	774	98	97767	24	47757	15	1464	99	120094	20	4672	191	16	10697	19	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		11036us		226ms		775ms		12233us		2866us		91878us		Latency	439us		531us		549us		90us		9us		64us
ssd-ati	4	6576M	648	99	454837	43	201613	29	1316	99	482448	35	12924	381	16	12573	14	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		38218us		175ms		109ms		12974us		124ms		3048us		Latency	90us		524us		556us		87us		7us		30us
ssd-ati	4	6576M	669	99	466455	47	202751	28	2659	99	475295	38	11540	333	16	15038	17	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		38867us		127ms		111ms		4088us		121ms		2753us		Latency	256us		524us		558us		91us		63us		63us
ssd-ati	4	6576M	625	99	467713	50	176020	57	1345	99	484839	47	11420	342	16	21595	24	+++++	+++	+++++	+++	+++++	+++	+++++	+++	+++++	+++
	Latency		39029us		113ms		54983us		11317us		74753us		4429us		Latency	207us		534us		573us		86us		65us		66us

As you can see, the SSD is much, much faster than any disk. ZFS is reaaaalllly slow, but that's because I'm running it through fuse so all IO needs to go through userspace. But that's ok, I don't need speed on those disks. Only safety. And once zfsonlinux gets around to finishing their code it should speed up tremendously.

If I get to buy a faster PC I'll update this doc with new measurements.

dinsdag 18 oktober 2011

scrubbing bits under linux; making sure your data is safe

Let's say you've got a couple of photo's you'd like to keep. Actually as the years have gone by, it's more like a few hundred, perhaps a few thousand photos you've collected and would really hate to lose. And let's assume you've lost data before due to a harddisk crash or otherwise. You swore never to be troubled by that problem again, so you set up a mirror, being raid-1, on your computer.
And let's assume you use Linux to manage that. Windows could too, but in a minute I'll show you why windows is no good for a mirror.

It all has to do with bit rot. You see, data on your disks isn't safe. It's not just because disks can die and you won't be able to access them. It's that data on your disks slowly rots and 1's can turn into 0's. 0's can turn into 1's. That makes your data corrupt. This happens because these days the bits on harddisks are so tiny that small influences can change them. Possibly cosmic rays. Or a magnet slightly too close to the disk. Perhaps the magnetic field of one of the computer fans. Or more likely a small patch of magnetic substrate that was just a bit off at production and isn't able to hold its contents for more than a month. What ever the reason, the fact is that data slowly rots.

Now you have a mirror. A raid set. Raid-1 even. You're not safe. But you're a bit safer. What you could do is have your computer run a check on the 2 disks, to make sure that the contents of the disks is still exactly the same. You could have it run every night. This is called a 'data scrub'.

And then one night, maybe a year from now, your computer will beep and state it has found a different. Perhaps it's only one bit. But what now? Which of the disks is right? Disk A says it's a 0. Disk B says it's a 1. If you're lucky you have backups and you can restore the file. If not... it's anyone's guess. Oh did I mention that backups can suffer from the same problem ? This is not just limited to harddisks. CD's rot too. So do tapes, floppy's, zip's, jazz's, etc.

Now, please, amuse me. Go a google for 'windows data scrub' and do one for 'linux data scrub'. Notice any difference? There are no data scrubbing tools for windows. But there are for Linux. The reason is that when Windows finds a difference, it will use the 'master' disk as the source to write to the 'slave' disk. Even when the master is wrong.

So under windows having a mirror is not much help. You won't even notice that there was a problem. Under linux you can have the scrub tell you there was a problem, and you can decide to do a file restore from backup.

If you are using raid-5 instead of raid-1 you have the added benefit of being able to repair the bad data. Read here on how to set up scrubbing for md-raid software raid setups under Linux.

And if you're really smart you're using a checksumming filesystem, such as ZFS or BTRFS. These calculate a checksum of the data and by that allow the OS to figure out which of the two disks has the correct data, and overwrite the bad data with the good data. Automatically. Problem fixed.

If you are not scrubbing yet, you don't value your photo's enough.

woensdag 28 september 2011

ZFS data corruption and CRC checksum errors

It's been a while since I took some time to analyze the following problem:

server ~ # zpool status -v
pool: mir-2tb
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
   entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 9h26m with 1 errors on Mon Sep 26 09:01:29 2011
config:

   NAME                                                       STATE     READ WRITE CKSUM
   mir-2tb                                                    ONLINE       0     0     1
      mirror-0                                                 ONLINE       0     0     2
        disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD ONLINE       0     0     2
        disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05894          ONLINE       0     0     2
        disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        /mnt/origin/mir-2tb/diskfailure/sdb1.image.dd

So there is a problem here. What this is saying is that the same block of data is not readable from 3 different disks. And the file sdb1.image.dd experienced data loss because of this.

Every week on monday, the counters for the amount of checksum errors will go up. That's because I'm running a scrub every monday morning. And once the scrub is done, the counter is 2 checksum errors higher for each disk, and one checksum error higher for the pool. So it's not like there are more and more sections of data that can't be read.

How did this happen? How small is that chance, that 3 drives are unreadable for the same specific block of data? It's infinitesimally small. And that's not what's going on here. There are 2 possible reasons for this:
- What's going on here is that the ZFS pool was created initially with one source disk, with apparantly 2 bad sectors or bad blocks. Data was copied to that pool before there was a 2nd disk to mirror the data. Once the data was copied and the source disk was empty, the source disk was added to the pool as a mirror. But because of the bad blocks/sectors, 2 blocks could not be copied to the other disk. Those 2 sectors therefore also contain invalid data. Later a 3rd disk was added, but is has the same problem as the 2nd disk. Because the bad sectors can not be read, this will not resolve itself. The data is lost.
- What's going on here is that there is a very specific type of data that triggers a bug in ZFS where the data cannot be stored or the checksum cannot correctly be calculated. The data is there but the checksum fails because of this bug. There is no actual data lost.

I can't remember whether I started this pool with the scenario of having 1 disk and adding more later. I no longer have the source file so I cannot overwrite the data with it. The only option I have is to try and find the block that is broken and force a write to it. This will force the disk to overwrite the data, making it readable again or in case of not being able to write will force the disk to re-allocate the sector.

However, the disks are reporting no problems at all regarding pending sector re-allocations, the only thing I notice is:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail Always       -       0
2 Throughput_Performance 0x0026   055   055   000    Old_age   Always       -       19396
3 Spin_Up_Time            0x0023   068   066   025    Pre-fail Always       -       9994
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       35
5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail Always       -       0
7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5924
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       70
181 Program_Fail_Cnt_Total 0x0022   098   098   000    Old_age   Always       -       48738283
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       37
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   052   000    Old_age   Always       -       31 (Min/Max 14/48)
195 Hardware_ECC_Recovered 0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector 0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       86
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       70

And that's not very helpfull either. There is one counter (Program_Fail_Cnt_Total) that's huge but some googling turns up that many users of this disk report that counter and it's not a problem for any of them. That doesn't worry me.

I wish that zfs had some more tools to make this analysis easier. Why does the CRC occur? What block exactly is having the problem.

So I'm at a loss here... there is a checksum error, but the disks don't report a problem with data on the disks. Is the problem therefore with ZFS, or with one of the disks that has (via ZFS) propagated to the other disks ? I would be very interested to hear anyones feedback on this.

choosing disks and controllers for ZFS

I looked at my ZFS setup and realised that I have 2 raid-1 setups, both with the same type of disk:

server ~ # zpool status
  pool: mir-2tb

 state: ONLINE

 scrub: none requested

config:

        NAME                                                       STATE     READ WRITE CKSUM
        mir-2tb                                                    ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0
            disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD  ONLINE       0     0     0
            disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD  ONLINE       0     0     0

errors: No known data errors

  pool: mir-2tb2
 state: ONLINE
 scrub: none requested
config:

        NAME                                                       STATE     READ WRITE CKSUM
        mir-2tb2                                                   ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0

            disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05894          ONLINE       0     0     0
            disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05884          ONLINE       0     0     0

errors: No known data errors
server ~ #

So that's one time ZFS with two hitachi disks and one time ZFS mirror with two samsung disks. That's bad because if there is a problem with a certain type of disk (such as a firmware problem with samsung HD204UI's, see here) there is a high chance the mirror will fail. ZFS can then detect the problem, but possibly not recover.I quickly bought another hitachi disk and added it to the samsung mirror. That allowed me to swap a samsung disk to the hitachi mirror pool. So now I have:

server ~ # zpool status pool: mir-2tb
state: ONLINE
scrub: none requested
config:

        NAME                                                       STATE     READ WRITE CKSUM
        mir-2tb                                                    ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0
            disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30SHWWD ONLINE       0     0     0
            disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05894          ONLINE       0     0     0
            disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30U4RKD ONLINE       0     0     0errors: No known data errors

pool: mir-2tb2
state: ONLINE
scrub: none requested
config:

        NAME                                                       STATE     READ WRITE CKSUM
        mir-2tb2                                                   ONLINE       0     0     0
          mirror-0                                                 ONLINE       0     0     0
            disk/by-id/ata-Hitachi_HDS5C3020ALA632_ML0220F30TG22D ONLINE       0     0     0
            disk/by-id/ata-SAMSUNG_HD204UI_S2H7J9BZC05884          ONLINE       0     0     0

errors: No known data errors
server ~ #

So now one pool has a raid-1 of 3 disks and the other a raid-1 of 2 disks. Availability thus even improved a bit.

Great, problem fixed. Not quite! Just making sure the disks in a pool are of mixed species is not enough.
It's not visible in the zpool status but the disks of the same pool were all on the same sata controller. I have an onboard sata controller and a promise sata controller. Here is a selective lspci:

00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] (rev 40)
05:05.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)

So if a controller were to either die, corrupt data or have intermittant problems that could again result in the pool being lost because ZFS wouldn't be able to fix the problem using a source that still has correct data. Easy fix, just make sure that the disks in a zfs pool are spread out over different controllers. That way the chance of problems again diminishes.

Too paranoid? Not really. Before I replaced my server a couple of months ago I was using an onboard sata controller that was corrupting the data transferred to the disks. I never knew until I switched to ZFS.

Don't blow your bits. Spread them around.

zondag 25 september 2011

real world usb3 disk performance

So a while ago I bought a new motherboard. It took me a while to make a choice because I wanted it to be cheap, while still having at least 6 sata connections, pata too and then be able to use ECC type memory. Having ZFS is great to keep your data correctly on your disks but if the data gets corrupted in memory you're still lost, hence the ECC.
I finally choose the asus M4A88TDM-EVO because it had all the specs I wanted AND included usb3. Great I thought, just what I need... this is future proof. Too bad it turned out to be just a separate pci-e 1x card. But it's there and after compiling a kernel module for it (usb3 requires the xhci module, or xhci_hcd on gentoo) it worked.

Kinda.

It's very spammy in dmesg. Lots of debug info it seems. Or at least info I don't need. This is kernel "Linux server 2.6.36-gentoo-r8 #2 SMP Tue Apr 5 08:18:52 CEST 2011 i686 AMD Athlon(tm) II X2 250 Processor AuthenticAMD GNU/Linux" mind you, so newer kernels may no have this.

Time to test this puppy out. Since the theoretical limit of usb2 is 60MByte/sec there may be a change that my disks are sometimes hitting the limit of the bus, instead of the speed limit of the platters. Since usb3 is 10x as fast, the bus will never be the bottleneck.
So would this really matter, in real life?

I used a Hitachi desktstar 7k1000 HDT721075SLA380 to test with. It's supposed to be able to read at speeds between 86MB/sec (outer disk section) and 46MB/sec (inner sections). With the 60MB max speed of usb2 there is something to gain here. The disk is connected using a LaCie usb-sata external enclosure.
The usb3 controller is this one:
02:00.0 USB Controller: NEC Corporation Device 0194 (rev 03)

I performed some tests, first on usb2. Please don't mind the changing drive letters, every time I unplug and re-plug the device it may take up a new drive letter. It remains the same drive.

server ~ # hdparm -tT /dev/sdl

/dev/sdl:

 Timing cached reads:   2644 MB in  2.00 seconds = 1322.00 MB/sec

 Timing buffered disk reads:  92 MB in  3.06 seconds =  30.03 MB/sec

server ~ #

So theoretically this disk should be able to transfer 1.3GB/sec from its cache to the cpu. When reading from the platter this drops to 30MB/sec. It's an old disk ;-) Hdparm is a bit of a synthetic test, so what does a file transfer do for us:

write:

server maxx # dd if=bigfile of=/mnt/usb-backup-disk-750g-20081001/bigfile

9170564+1 records in

9170564+1 records out

4695328969 bytes (4.7 GB) copied, 211.23 s, 22.2 MB/s

server linux # dd count=9000000 if=/dev/zero of=/mnt/usb-backup-disk-750g-20081001/bigfile

9000000+0 records in

9000000+0 records out

4608000000 bytes (4.6 GB) copied, 133.498 s, 34.5 MB/s

server linux #

read :

server linux # dd if=/mnt/usb-backup-disk-750g-20081001/bigfile of=/dev/null

9170564+1 records in

9170564+1 records out

4695328969 bytes (4.7 GB) copied, 138.333 s, 33.9 MB/s

read while taking the filesystem out of the loop:

server linux # dd count=10000000 if=/dev/sdk of=/dev/null

10000000+0 records in

10000000+0 records out

5120000000 bytes (5.1 GB) copied, 155.335 s, 33.0 MB/s

[odd, that's actually slower then when reading from the fs... ]

then on usb3:

server ~ # hdparm -tT /dev/sdn

/dev/sdn:

 Timing cached reads:   2434 MB in  2.00 seconds = 1216.78 MB/sec

 Timing buffered disk reads: 104 MB in  3.00 seconds =  34.63 MB/sec

server ~ #

that's a small improvement to read from the platters. Interestingly the cache read is slightly slower

write:

server maxx # dd if=bigfile of=/mnt/usb-backup-disk-750g-20081001/bigfile

mount

9170564+1 records in

9170564+1 records out

4695328969 bytes (4.7 GB) copied, 123.346 s, 38.1 MB/s

write zero's
server maxx # dd count=9000000 if=/dev/zero of=/mnt/usb-backup-disk-750g-20081001/bigfile
9000000+0 records in
9000000+0 records out
4608000000 bytes (4.6 GB) copied, 122.208 s, 37.7 MB/s

read:

server maxx # dd if=/mnt/usb-backup-disk-750g-20081001/bigfile of=/dev/null

9170564+1 records in

9170564+1 records out

4695328969 bytes (4.7 GB) copied, 119.702 s, 39.2 MB/s

read from disk without fs:
server maxx # dd count=9000000 if=/dev/sdk of=/dev/null
9000000+0 records in
9000000+0 records out
4608000000 bytes (4.6 GB) copied, 122.362 s, 37.7 MB/s

Next up was bonnie. When running the usb3 test my machine catched a kernel panic! Odd.. I reset the machine and run bonnie again. No crash this time. This was a trigger for me to implement
`bonnie++ -d /mnt/usb-backup-disk-750g-20081001/tmp -s 6576 -c 4 -m usb220110925 -x 3 -u 1000:100`

the result:

Concurrency

Size

Per Char

Block

Rewrite

Per Char

Block

Num Files

Create

Read

Delete

Create

Read

Delete

K/sec

% CPU

K/sec

% CPU

K/sec

% CPU

K/sec

% CPU

K/sec

% CPU

/sec

% CPU

/sec

% CPU

/sec

% CPU

/sec

% CPU

/sec

% CPU

/sec

% CPU

/sec

% CPU

usb2

6576M

622

31604

14249

1215

34417

172.1

10578

+++++

+++

31734

31396

+++++

+++

+++++

+++

Latency

54816us

599ms

633ms

34401us

47854us

646ms

Latency

3091us

2151us

2125us

17656us

150us

6314us

usb3

6576M

615

32326

14434

1579

34827

172.8

19709

+++++

+++

+++++

+++

32034

+++++

+++

+++++

+++

Latency

44231us

673ms

643ms

41676us

64893us

655ms

Latency

6150us

530us

806us

4056us

283us

2863us

usb2

6576M

679

32972

14547

1490

34936

172.9

10681

+++++

+++

20249

15184

+++++

+++

20964

Latency

29018us

438ms

649ms

31497us

30470us

587ms

Latency

2858us

2295us

3218us

3825us

5738us

2266us

usb3

6576M

583

36709

17449

1025

43703

195.4

13722

+++++

+++

+++++

+++

18304

+++++

+++

+++++

+++

Latency

38191us

519ms

538ms

27632us

28568us

539ms

Latency

547us

509us

706us

211us

15us

36us

usb3

6576M

645

36868

17472

1971

43596

198.3

14829

+++++

+++

+++++

+++

+++++

+++

+++++

+++

+++++

+++

Latency

40016us

520ms

584ms

24271us

27662us

474ms

Latency

558us

509us

555us

84us

8us

32us

usb3

6576M

782

36838

17439

1490

43479

196.4

22927

+++++

+++

+++++

+++

+++++

+++

+++++

+++

+++++

+++

Latency

10200us

583ms

562ms

28055us

37441us

538ms

Latency

434us

514us

559us

87us

8us

35us

The above results are rather clear. The usb3 controller is resulting in better performance according to bonnie too. Just to be sure I re-ran the test and the results were slightly less pronounced.

In conclusion:

Using a usb3 controller resulted in a 10%-20% increase in performance on the same disk while using a usb2 connection. At times the difference is smaller, rarely it is bigger.

Too bad this usb controller only has 2 connections. I wonder how much performance remains when I start daisy-chaining multiple devices on those connectors...

vrijdag 23 september 2011

SSD temperature problems

Some (most) SSD's don't have a temperature sensor. The result is that temperatures can be reported that are just wrong. Like 127 degrees, or -1. For servers this can be a problem because the server can't be sure the sensor is defective or whether there is a real problem. So it will spin the fans at maximum speed in order to try and cool the disk. That increases the power usage slightly and increases the noise level drastically. Within a DC that may not be a problem (if you have proper hearing protection)

I've tried 3 SSDs and 2 happened to work fine:
60GB OCZ Vertex 2 (OCZSSD2-2VTXE60G) - broken (firmware 1.33)
60GB Corsair Force (CSSD-F60GB2-BRKT) - fine
60GB Corsair Nova (CSSD-V60GB2) - fine

Some disks don't have a sensor but report a fixed temperature. That also works. Intel's 710 SSD range seems to have an actual temperature sensor.

So if you plan on using such a consumer grade SSD for a solution that bases fan speed on disk temperatures (like servers) this can help you figure out just slightly better what to buy, and what not to buy.

2011-10-18 update:
OCZ vertex 3 also does not have a (working) temperature sensor according to smart

maandag 19 september 2011

4k sectors and aligning for LVM

I've been running an LVM volume spread out over a slowly growing number of disks for a couple of years now. Recently a disk broke and I RMA'd it. It was a western digital wd10eads. I got back a wd10ears. note the tiny differenc. Same size disk. But now using 4k sectors instead of 512byte sectors.
Interesting!

You know how new harddrives these days work with 4k sectors, right? And how many disks 'lie' about this and say that they (still) have a sector size of 512 bytes?
I have read about the 4k sector size a while ago and became aware of a number of possible problems that can come up when using those kinds of drives that 'lie'.

In general, what's the biggest problem?
If you are using ext3, then data is stored in chunks of 4096bytes (4k) by default. But the because the disk is stating it stores data in 512 byte chunks the OS will split up 4k writes into 8 writes of 512 bytes. That's sub optimal, but the real problem happens when you use partitions. By default, the 1st partition start at sector 63. That means that sectors 0-62 are blocks of 512 bytes used for the MBR and other things, and then sector 63 is the 1st data for your 1st partition. And because the ext3 filesystem stores data in 4k chunks [AFAIK!], the 1st ext3 inode fills sectors 63-71 (8x 512byte sectors). However.... sector 63 is not the start of a 4096 byte sector on the disk itself. The 4k blocks start at sectors 0, 8, 16, 24, 32, 40, 48, 56, 64
This means that when the OS wants to read the 1st inode on the 1st partition, it will send 8 requests for 512 byte blocks of data. The disk then needs to read 2 sectors, because these 8 blocks are split over 2 actual 4k sectors physically.
If however the 1st partition would have started at the 64th sector, the 4k inode would be exactly aligned with the actual sector on the disk. Then the OS would request 8 blocks and the disk would only need to read 1 4k sector. To make that 1st partition start at the right place is called 'alignment'.

This isn't too bad when you're just requesting 1 small block of data. But when doing big file operations or the server is very heavy on IO, you're in trouble.

For writes the problem is much worse. When doing a write of a 4k inode the disk will first need to read both sectors before it can write, because only part of the sectors will be updated. The 1st physical 4k sector will get 512 bytes overwritten, leaving 3584bytes untouched. The next sector will get 3584 bytes overwritten, with 512 bytes untouched. Those untouched bytes need to 1st be read by the disk, because it can't do a partial write to a sector. Disks can only handle entire sectors at once. So for a write the disk first does 2 reads. Then it does 2 writes.
In the aligned situation it only needs to do one write. Because there are no bytes left untouched in a particular sector.

That is the situation when you use partitions. Luckilly my disk is used in an LVM volume group and I just gave the entire disk to LVM. I did not create a partition. There is no need to, when you use LVM.
What does this mean for alignment and performance for this disk? It means I'm probably ok.
Probably? I couldn't find any information on this on the net. Which means I want to go and take a look at the actual data on the disk, and at the LVM source code to figure out how it stores the data on disk.
A quick performance test on the disk also showed that there is most likely no alignment problem. Otherwise the write performance would have been much lower.

More will follow....

zondag 18 september 2011

"windows dynamic disks - mirroring" useless

I run Windows 7 on my desktop. I'm a big Linux fan but the Linux desktop experience is IMHO still not there yet. Although I must admit the last time I tried is 3 years ago and things have improved. Guess I'm going to give ubuntu a shot soon :)

Because my aversion to data loss and downtime I run Windows on a pair of disks in a mirrored setup. One of the disks is pata and the other is sata. It's what I had lying around. Why spend money when you have parts on the shelf? My on-board ata controller doesn't support setting up a mirror over pata and sata. They must be on the same type of connection. So instead I configured windows to do the mirroring for me. Most on-board hardware 'raid' controllers are just a bios call that's handled on the cpu anyway.
Therefore there shouldn't be that much of a performance difference between doing a mirror action in the OS versus doing it via the bios on the cpu. That's the way it is under Linux in my experience.
In Windows 7 (and older versions of windows) disks need to be converted to a special partition format to be able to partake in a raid set. Be it mirror or otherwise. This can be done through the 'computer management' application, in the 'storage' section. So I set it up and all was fine. It spent some time copying the data from the source to the mirror disk and all was done. Everything is fine, right?
Not so. Every reboot the sync starts over. All of it. Not just the bits that are out of sync, everything.
That doesn't happen under Linux. There it keeps track so that when you reboot the raid is still in sync.
Not the same for Windows 7.
So what? Well it means when I boot windows my disk performance goes down the drain for a couple of hours until it has re-synced everything. And that sucks.
There are advantages too... if I was using a 'fake' raid controller and the controller would die, I run the risk of losing my data if the configuration is in the raid controller and it uses some propriatary format on the disks. Windows mirror will maintain integrity. you could even move a windows disk to a new machine and it will run. no configuration needed. Nice, yet not what I'm looking for.

Time to kick Windows mirroring out of the door and find some other way to do this.

But... you can't. Not while it's syncing. Crazy right? So first I have to wait for an entire sync to be complete, only to break the mirror and remove the sync configuration.
Too bad there are no alternative filesystems for windows. It's just fat or ntfs. Linux has many filesystems. Another reason to just choose open source. Or apple ;-)

zaterdag 17 september 2011

harddisk firmware warnings

A while ago I bought 2 new harddisks. They were Samsung EcoGreen HD204UI's. Great value in my opinion. Since I don't trust harddisks I put ZFS on them. That will give me a bit more faith to keep my data. Faith... no guarantees. Software bugs could still kill all data on there. But I disgress.
How big was my surprise when, a couple of weeks later, I just happened to read a forum post about these drives having some problems when using smart functionality (check here for details: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks). I have these drives running under Linux. Smart checks my disks periodically. If data would happen to get written to the disk just at the moment that a smart command would be issued, I'd end up with a bad sector. WTF!
I was able to find the patch that Samsung put on their websites by googling for it. There was no link to it on the samsung product page for these drives. These was no announcement. I patched the drives and they have been humming happily ever since.

But what about my other drives? Are they ok? Do they have problems that warrant a firmware update? How do I find out when drive makers don't willingly publish such information in fear of such things hurting sales? I have no idea. The most logical way to find out is to run into trouble. I won't spend the time googling the internet every month to try and find this out. I should be notified. Isn't that what 'registering your product' is for? Registrations have never done anything for me.

Hard disk manufacturers should step up and provide at least some way of getting notified. If they don't, someone may do it for them.

vrijdag 16 september 2011

boring batteries made livelier

Back when I was in college I had a guy living in our dorm that was working on state of the art batteries. He was very enthousiastic about them, but I didn't really understand why. Sure it would be neat to have a bit more juice in my batteries. It would save me from recharging my phone every day. But there is a law on batteries and power that is similar to moore's 'law' on the number of transistors for cpu's; The more capacity something has, the more it will be used. The increase in usage offsets the increase in capacity.
Now, years later, I find myself still using AA batteries. And some AAA too, for things like remote controls. Environmentally conscious as I am I use rechargeable batteries. I've never really given it much thought. I bought a charger a while ago with some batteries and used them. Then I bought some more rechargeable batteries because I ran short on AA's. And somehow after a couple of months they didn't seem to work that well anymore. After charging the batteries they would only work for a short period of time. I assumed it was the charger so I bought a better one with better batteries. This went on for a while until I had 3 chargers and 20 batteries and a lot of frustration. What battery did I charge again? Which one is the new one? Is this battery fully charged? Bah, let me charge it again. and then waiting over 12 hours for a battery to charge that has a text on the side of it saying it charges in 7 hours.
So I started to do some research. It turns out I had 2 different types of chargers. One for NiCd batteries. And two for NiMH and NiMH/NiCd. Trying to charge the NiMH batteries in the NiCd didn't work that well apparantly. There also was no information on the chargers about their charge rate. Or why one charger would flag certain batteries as bad, while the other chargers would not.

I needed something better. I needed something that would test my batteries for proper functioning and would charge them properly.

I have found such a device. It's called the voltcraft ipc-1. The ipc-1 does it all:
It handles NiCd and NiMH.
Bad battery detection is automatic
It can charge faster than any charger I've ever had (1800mA/h!), although you probably don't want to (see below).
It can re-condition, aka refresh, batteries.

Wait... it can 'refresh' batteries? Well apparently batteries lose their mojo after either time or repeated usage. By discharging and charging them a number of times the battery gets back some of its mojo and is able to continue its work for you for a while longer, before you need to replace it. That's great! I never knew that. That was never mentioned on any of the batteries I bought. Doing this manually is a boring task. You need to keep count, and it takes for ever. The ipc does it by itself. Just give it 1-4 batteries, tell it to refresh and it will tell you when it's done. Horay for ease of use :)

Reading more about batteries and charging I found out that the faster you charge a battery the quicker it will wear down. This is very much dependent on the battery quality. Better brands allow for faster charging with less (or perhaps no) wear. So even though the ipc can charge at 1800mAh you probably shouldn't.

I hope this was somewhat informational.

As a parting thought: why are non-rechargeable batteries still being sold? The EU ruled that incandescent light bulbs can no longer be sold because there are many good alternatives these days. Why not do the same for batteries? This is a huge strain on the environment. Many people throw them in the trash instead of recycling them. It's a waste of your money too.....

woensdag 14 september 2011

harddisk data safety

When I use a harddisk I know I'm trusting my data, which I care about, to a devise that is guaranteed to give errors.

The rate at which it gives errors is listed in the harddisks specifications. For a 2TB disk the non-recoverable read error rate can be 1 sector in 10^15 bits. That's a good buy. But there are also disks out there that offer a higher error rate, at 1 in 10^14. What does that mean? It means the disk has more errors than one in 10^15 but most can be corrected (by re-reading, error correction calculations, etc). Once in every 10^15 bits it can't recover. With 8 bits per byte that translates to one sector in 133,7TB. That sounds like veeeeery little. But is it? If I read the disk 60 times, I'm sure to get a read error. One sector is still usually 512 (but this will very soon change to 4k). So with 60 full disk reads you risk losing 512 bytes of data. That's nothing, right?
Think again. If this happens to be in the middle of a file you store your adminstration in, you had better have a backup. You have backups right? And not on the same disk, right?

The situation gets much worse when disks get larger. When you but a 3TB disk you only need to do 40 full reads. 4TB disks and you're down to 30. See what I'm getting at? If you have a 10^14 error rate disk of 2TB you get a read error every 6 full reads. That's data loss people. Or is it? The read error was 'unrecoverable', but does that mean that all retries will fail also? I'd love some feedback on that.

Raid is one way to solve this. But beware: If you use raid-5 and even raid-6 you're making the problem worse! If one disk in your raid-5 setup fails, you replace it and a re-build takes place. That's a full read of your disks! That means if any read error occurs then *poof* your data is gone. Cheap read controllers then stop the rebuild and you have just lost everything.
Using raid-1 or raid-10 is a much better solution if you value your data. But more expensive too.

However, most of the time harddisks are not fully used. Especially new disks are mostly empty initially. So if we have this error rate but duplicate the data on disk? If all your data is on the disk twice then the chance of your data being lost is much lower. And lower even when even more copies are made. To do this yourself (saving 6 copies of the same file on your filesystem) is a lot of extra work.

There are filesystems that do this, somewhat. For instance on ZFS you can configure the number of copies that should be made. But I'm looking for something more intelligent. I'm calling out to harddisk makers to make as many copies of the existing data as the free space will allow. Then when an error occurs it can just go look at one of the copies and repair the problem area. Problem solved. Simple idea really. Can it be done? I hope so :)

maandag 12 september 2011

MS word and open-office

I don't like word. The microsoft product that is. For a program that's supposed to make my text look good, it's way too complicated. I think I use maybe 10% of its capabilities and yet I find myself always looking for buttons to try and get done what I need to get done. That's such bad user interaction design.
So I use open office. I don't like open office.
I like open. Just not the product. It doesn't read my MS word documents very well. Sure the text is copied and the markup is close, but it's not the same. And it looks awfull. This program also has me chasing buttons to figure out how to do things.
Both products suffer from me not knowing how the tool really works. It's like a mix between Windows and Linux. In this case the products are easy to get started with (like Windows) but then you hit a sudden and very steep learning curve (like Linux).

All I want is to do things that are obvious. Like moving tables visually. Or changing borders/spacing. I know I'm not alone. Nobody in the office I work in knows how to do those somewhat beyond the basics things either. How do you change the inter-line space and why does it suddenly change sometimes ?

How come there is no 'easy' tool that just does the basics? I'm thinking there is real market potential here. I'll just use VI in the mean time ;)

zondag 11 september 2011

SSL security - diginotar

In case you didn't know, everyone in between your computer and the site you're looking at can see everything you send to the site, and receive from that site. Including usernames and passwords. That is, if you're not using a secure connection to that websites. When is it secure? When the address in your URL bar starts with HTTPS instead of HTTP.

The S in httpS stands for secure. It's made secure using encryption. Anyone can make an encrytion key and certificate. But your browser won't trust them. It only trusts a set list of companies that make money by giving out certificates. Those companies are called certificate authorities (CA's).

Recently a Certificate Authority was hacked. The company is called diginotar and they hand out SSL certificates. Unfortunately the hackers got access to the system in such a way that they were able to hand themselves out certificates for any domain they wish. This means they can pretend to be amazon.com and handle payments. Or pretend to be gmail and get your username and password.
Ofcourse most browser makers (apple, microsoft, google, mozilla) have quickly removed the diginotar company from the list of trusted CA's.

But who knows which other CA's have also been hacked? Most hacks go un-announced. Is your 'secure' connection really secure? Goodbye safe feeling when doing online shopping! Goodbye feeling on privacy when reading your mail. Don't even think of being a political activist.

Luckilly there is a solution: http://www.networknotary.org/firefox.html
This firefox plugin checks the validity of a received certificate not just from only your viewpoint, but from several viewpoints all over the world. That way a local 'Man in the Middle' attack becomes increasingly difficult.

Take note: the diginotar hack also had a certificate for the firefox addons domain. They could be sending you a different version of the plugin than the one you're really looking for. But if you're not in the middle east or China, I suspect the chances of that happening are somewhat slim.

The perspectives plugin only helps if you're on a website that's talking HTTPS. Most websites default to HTTP unless you specifically ask for HTTPS. That's somewhat annoying, having to type https all the time, forgetting, etc. There is also a plugin to help with this: http://www.eff.org/https-everywhere
It's not perfect: it only works for a specific list of websites. But you can add your own. It's a start :)

Feel safe, use perspectives! Be save, use https-everywhere!