Subject: Re: Overlapping bread(9) behaviour
To: Bill Stouder-Studenmund <wrstuden@netbsd.org>
From: Stephen M. Rumble <stephen.rumble@utoronto.ca>
List: tech-kern
Date: 07/03/2007 14:12:18
Quoting Bill Stouder-Studenmund <wrstuden@netbsd.org>:

> Yes. One requirement of the buffer cache is that any block on disk is
> cached in exactly one place at any one time. Where that is can change
> (say as a file gets deleted and the underlying disk re-used), but there's
> only one place at any one time.

Okay, but the bread() case of two invocations with the same offset,  
yet different size parameters is still poorly handled. Should an  
assertion be made? I don't understand why bread() should happily fetch  
an in-core buffer of length N - epsilon, allocate a new buffer of  
length N, copy the smaller buffer's contents, and return it claiming  
it's truly of length N. Is this actually useful and used, or does  
everybody simply not do this? If the latter, I'd like to make sure  
that it can't be done.

>> A few related questions: If the buffer cache expects fixed-sized
>> buffers, does that mean for some filesystems there could be a 124-byte
>> struct buf for each block of cached data? Also, do we not have any
>> filesystems with extents where this sort of thing would have cropped
>> up before?
>
> You could have 124-byte blocks if you wanted. The problem is that you
> can't do atomic i/o on them, and we implicitly assume you can atomically
> write a buffer block.

I was referring to the size of struct buf, rather than some queer block size.

Where do the atomic i/o assumptions stem from? Is this a guarantee  
provided by the disk? And if so, how strong is this guarantee? I.e.,  
are devices specifically designed to either write a complete block or  
nothing at all?

> Extents don't matter here. Extents are still ranges of fixed-size blocks,
> they just are described differently in the file metadata.
>
> Also, we have a limit to the maximum physical transfer size. Right now
> it's 64k. We want to raise that, but I don't think you want to be doing
> much over 256k in general. So you won't want a 1:1 mapping between extents
> and buffer entries, you'll want 1:many.

I think I could do a 1:1 mapping for indirect extents, as they were  
artificially limited in length by SGI. The on-disk structures are  
capable of exceeding that 64k limit, however.

> ext2fs uses extents and is in tree.

I'll take a peek.

Thanks,
Steve