maandag 19 september 2011

4k sectors and aligning for LVM


I've been running an LVM volume spread out over a slowly growing number of disks for a couple of years now. Recently a disk broke and I RMA'd it. It was a western digital wd10eads. I got back a wd10ears. note the tiny differenc. Same size disk. But now using 4k sectors instead of 512byte sectors.
Interesting!

You know how new harddrives these days work with 4k sectors, right? And how many disks 'lie' about this and say that they (still) have a sector size of 512 bytes?
I have read about the 4k sector size a while ago and became aware of a number of possible problems that can come up when using those kinds of drives that 'lie'.

In general, what's the biggest problem?
If you are using ext3, then data is stored in chunks of 4096bytes (4k) by default. But the because the disk is stating it stores data in 512 byte chunks the OS will split up 4k writes into 8 writes of 512 bytes. That's sub optimal, but the real problem happens when you use partitions. By default, the 1st partition start at sector 63. That means that sectors 0-62 are blocks of 512 bytes used for the MBR and other things, and then sector 63 is the 1st data for your 1st partition. And because the ext3 filesystem stores data in 4k chunks [AFAIK!], the 1st ext3 inode fills sectors 63-71 (8x 512byte sectors). However.... sector 63 is not the start of a 4096 byte sector on the disk itself. The 4k blocks start at sectors 0, 8, 16, 24, 32, 40, 48, 56, 64
This means that when the OS wants to read the 1st inode on the 1st partition, it will send 8 requests for 512 byte blocks of data. The disk then needs to read 2 sectors, because these 8 blocks are split over 2 actual 4k sectors physically.
If however the 1st partition would have started at the 64th sector, the 4k inode would be exactly aligned with the actual sector on the disk. Then the OS would request 8 blocks and the disk would only need to read 1 4k sector. To make that 1st partition start at the right place is called 'alignment'.

This isn't too bad when you're just requesting 1 small block of data. But when doing big file operations or the server is very heavy on IO, you're in trouble.


For writes the problem is much worse. When doing a write of a 4k inode the disk will first need to read both sectors before it can write, because only part of the sectors will be updated. The 1st physical 4k sector will get 512 bytes overwritten, leaving 3584bytes untouched. The next sector will get 3584 bytes overwritten, with 512 bytes untouched. Those untouched bytes need to 1st be read by the disk, because it can't do a partial write to a sector. Disks can only handle entire sectors at once. So for a write the disk first does 2 reads. Then it does 2 writes.
In the aligned situation it only needs to do one write. Because there are no bytes left untouched in a particular sector.

That is the situation when you use partitions. Luckilly my disk is used in an LVM volume group and I just gave the entire disk to LVM. I did not create a partition. There is no need to, when you use LVM.
What does this mean for alignment and performance for this disk? It means I'm probably ok.
Probably? I couldn't find any information on this on the net. Which means I want to go and take a look at the actual data on the disk, and at the LVM source code to figure out how it stores the data on disk.
A quick performance test on the disk also showed that there is most likely no alignment problem. Otherwise the write performance would have been much lower.

More will follow....

Geen opmerkingen:

Een reactie posten