Subject: Re: DEV_B_SIZE
To: Julian Elischer <julian@elischer.org>
From: Steve Byan <stephen_byan@maxtor.com>
List: tech-kern
Date: 01/31/2003 13:56:09
On Friday, January 31, 2003, at 01:16  PM, Julian Elischer wrote:

>
>
> On Fri, 31 Jan 2003, Steve Byan wrote:
>
>> There's a notion afoot in IDEMA to enlarge the underlying physical
>> block size of disks to 4096 bytes while keeping a 512-byte logical
>> block size for the interface. Unaligned accesses would involve either 
>> a
>> read-modify-write or some proprietary mechanism that provides
>> persistence without the latency cost of a read-modify-write.
>>
>> Performance issues aside, it occurs to me that hiding the underlying
>> physical block size may break many careful-write and
>> transaction-logging mechanisms, which may depend on no more than one
>> block being corrupted during a failure. In IDEMA's proposal, a power
>> failure during a write of a single 512-byte logical block could result
>> in the corruption of the full 4K block, i.e. reads of any of the
>> 512-byte logical blocks in that 4K physical block  would return an
>> uncorrectable ECC error.
>>
>> I'd appreciate hearing examples where hiding the underlying physical
>> block size would break a file system, database, transaction processing
>> monitor, or whatever.  Please let me know if I may forward your reply
>> to the committee. Thanks.
>
> I presume that if such a drive were made, thre would be some way to
> identify it?

Yes, but my concern is that advocates claim existing software could 
work (albeit slowly) with such a drive. It's hard to retroactively 
modify binaries installed in the field to adapt to a larger block size 
:-)
>
> It would be very easy to configure a filesystem to have a minimum
> writable unit size of 4k, and I assume that doing so would be
> slightly advantageous. (no Read/modify/write). it would however
> be good if we could easily identify when doing so was a good idea.

Yes, I've built and run OSF/1 on a system with 4K sector size; this was 
essentially BSD4.3. Modifying DEV_B_SIZE and recompiling the world was 
sufficient (well, actually the boot loader had to know the block size, 
and I needed a way to format the disks to 4K, and ...).
>
> Another idea would be to have some way that you could specify a block
> number and have teh drive tell you the first in the same group.. That
> would allow a filesystem to work out the alignment. It may not be able
> to access absolute block numbers, if it's going through some layers of
> translation, and some way of saying "am I alligned?" might be useful.
>
> One thing that does come to mind is that as you say, on power fail we
> would now be liable to lose a group of 8 sectors (4k) instead of 1 x 
> 512
> byte sector.
>
> Recovery algorythms might have to deal with this (should we actually
> decide to write one.. :-).
>
> Particularly if the block being written was the 1st, but the other 7
> blocks contain data that the OS has no way of knowing that they are in
> jeopardy. In other words, I might know that block 1 is in danger and 
> put
> it in a write log, (in a logging filesystem) but I have no way of
> knowing that the other 7 are in danger, so they may not be in the write
> log (assuming thAat the write log only holds the last N transactions.).
> I'd say that this means that the drive should hold the active 4k block
> in nvram or something..
>
> You seem to have considered this but I'm in agreement that it could
> prove "nasty" in exactly the cases that are most important..
> people use write logging etc. in cases where they care about the data
> and recovery time. these are exactly the people who are going to be the
> most pissed off to lose their data. ..

Thanks, may I forward your response on to the committee?
>
> If we can easily telll the system to use 4k frags or 4k blocknumbers
> (i.e. we can elect to expose the real blocksize) then we are probably
> in better shape.

I agree.

Regards,
-Steve
--------
Steve Byan <stephen_byan@maxtor.com>
Design Engineer
Maxtor Corp.
MS 1-3/E23
333 South Street
Shrewsbury, MA 01545
(508) 770-3414