Subject: Re: DEV_B_SIZE
To: Steve Byan <stephen_byan@maxtor.com>
From: Julian Elischer <julian@elischer.org>
List: tech-kern
Date: 01/31/2003 10:16:41
On Fri, 31 Jan 2003, Steve Byan wrote:

> There's a notion afoot in IDEMA to enlarge the underlying physical 
> block size of disks to 4096 bytes while keeping a 512-byte logical 
> block size for the interface. Unaligned accesses would involve either a 
> read-modify-write or some proprietary mechanism that provides 
> persistence without the latency cost of a read-modify-write.
> 
> Performance issues aside, it occurs to me that hiding the underlying 
> physical block size may break many careful-write and 
> transaction-logging mechanisms, which may depend on no more than one 
> block being corrupted during a failure. In IDEMA's proposal, a power 
> failure during a write of a single 512-byte logical block could result 
> in the corruption of the full 4K block, i.e. reads of any of the 
> 512-byte logical blocks in that 4K physical block  would return an 
> uncorrectable ECC error.
> 
> I'd appreciate hearing examples where hiding the underlying physical 
> block size would break a file system, database, transaction processing 
> monitor, or whatever.  Please let me know if I may forward your reply 
> to the committee. Thanks.

I presume that if such a drive were made, thre would be some way to
identify it?

It would be very easy to configure a filesystem to have a minimum
writable unit size of 4k, and I assume that doing so would be 
slightly advantageous. (no Read/modify/write). it would however 
be good if we could easily identify when doing so was a good idea.

Another idea would be to have some way that you could specify a block
number and have teh drive tell you the first in the same group.. That
would allow a filesystem to work out the alignment. It may not be able
to access absolute block numbers, if it's going through some layers of
translation, and some way of saying "am I alligned?" might be useful.

One thing that does come to mind is that as you say, on power fail we
would now be liable to lose a group of 8 sectors (4k) instead of 1 x 512
byte sector.

Recovery algorythms might have to deal with this (should we actually
decide to write one.. :-).

Particularly if the block being written was the 1st, but the other 7
blocks contain data that the OS has no way of knowing that they are in
jeopardy. In other words, I might know that block 1 is in danger and put
it in a write log, (in a logging filesystem) but I have no way of
knowing that the other 7 are in danger, so they may not be in the write
log (assuming thAat the write log only holds the last N transactions.).
I'd say that this means that the drive should hold the active 4k block
in nvram or something..

You seem to have considered this but I'm in agreement that it could
prove "nasty" in exactly the cases that are most important..
people use write logging etc. in cases where they care about the data
and recovery time. these are exactly the people who are going to be the 
most pissed off to lose their data. ..

If we can easily telll the system to use 4k frags or 4k blocknumbers
(i.e. we can elect to expose the real blocksize) then we are probably
in better shape.