Subject: Re: Supporting sector size != DEV_BSIZE -- patches
To: None <tech-kern@netbsd.org>
From: Trevin Beattie <trevin@xmission.com>
List: tech-kern
Date: 06/03/2002 21:14:35
>Off-list? Ok. :-)

Oops!  Sorry, Bill; I just forgot to send the reply to tech-kern.  {:-}

>On Mon, 3 Jun 2002, Trevin Beattie wrote:
>
>> At 12:49 PM 6/3/2002 -0700, you wrote:
>> >On Sat, 1 Jun 2002, Trevin Beattie wrote:
>> >
>> >I don't think that's the best direction to go. I think it would be better
>> >to do everything in terms of underlying disk blocks. Mainly since physical
>> >block size is a property of the drive, so if you take a drive to another
>> >machine, it will work. If we depend on DEV_BSIZE (and I'm understanding
>> >you right), then different kernels will look in different places.
>>
>> Perhaps so; OTOH, if you depend on physical sector size, and try to dump a
>> file system from one drive to another having a different sector size (as
>> somebody mentioned in this thread last October), that will also fail.  In
>> any event, two arch's with differing DEV_BSIZE will probably also have
>> different default sector sizes, so you couldn't transfer a drive between
>> them anyway.  (I could be wrong; has anybody tried this already?)
>
>But... But... The whole point is to stop assuming that all sectors are the
>same size which happens to be DEV_BSIZE. :-) So that in the future you CAN
>take a drive from whatever you want to whatever you want.
>
>Yes, you won't be able to dump a file system from one disk to one of a
>different sector size. But what you CAN (well, will/should) be able to do
>is do the dump and then build a vnd on top of it, and get the right sector
>size.
>
>> I'm thinking the 'right' thing to do would be to try separating the file
>> system completely from the physical media.  The I/O system would be divided
>> into cleanly separated layers like this:
>>
>>             ffs (or other fs): N * fs_fsize-byte blocks
>>                         ----------------
>>             Buffer cache: V * DEV_BSIZE-byte buffers
>>                         ----------------
>>             block device: X * d_secsize-byte sectors
>>
>> Any time an offset or block count is passed from one layer to another, the
>> code should know how to convert the number from the units on one side to
>> the units on the other.  But no layer should assume, store, or otherwise
>> depend on the units used by another layer.
>
>To the extent we can, I like that separation. However, a number of file
>systems are already defined to NOT respect that separation. So there's not
>much we can do..
>
>> Currently one of the problems with this idea is that part of the ffs --
>> cylinder groups -- is dependent on the physical disk geometry (which, BTW,
>> is fake on many modern drives).  This is something I admit I haven't gotten
>> around in my patches.  The existing implementation computes cylinders in
>> terms of sector size, yet the ffs doesn't know (store) what the sector size
>> is!  That value is used by mkfs when creating the file system, then thrown
>> away.
>
>I agree about geometry being faked up. And that newfs sometimes gets quite
>fussy about things that don't matter. But are you sure about ffs throwing
>away the sector size? :-) fs_fsbtodb includes it, and the patches I came
>up with (on the devbsize branch) only had to make sure it was correct to
>get the fs working on 2k disks.

Okay, technical point here.  fs_fsbtodb (before I patched it) is computed
in terms of sector size, and I suppose you could re-compute the sector size
as [1 << (fs_fshift - fs_fsbtodb)].  But the plain sector size itself (or
log2(sectorsize)) is not stored in the fs structure.

The reason I changed fs_fsbtodb to use DEV_BSIZE units is because most of
the instances of fsbtodb that I've found pass the result to bread() or
set/compare the value to b_blkno, which require DEV_BSIZE units.

This is one of the things I _had_ to change to get a working ffs created on
a 2KB/sector disk.

>
>> >
>> >The main locus of the question is what have people been doing so far when
>> >they use non-512-byte sectors? Yes, we have hacks for CDs. But what have
>> >the MO folks been doing? Any MO users who can speak up?
>> >
>> >Also, what does FreeBSD do? We've worked hard to keep our on-disk format
>> >the same. Will this mean changing it?
>> >
>> >I think whatever we do should be backwards compatable with them, if
>> >possible.
>>
>> I completely agree that we should be compatible with other OS's and their
>> file systems, *on condition* that those systems work with large sector
>> sizes.  (There's no point in letting NetBSD's implementation break just
>> because another OS is broken, is there? :)  That's why I left out changes
>> to other file systems.
>
>But is our implementation breaking?

With the exception of NeXTSTEP 3.0, I haven't found an implementation that
*isn't* broken WRT large sectors.  This includes NetBSD, although I must
say our implementation has greatly improved since I first looked at the
problem, at least three years ago.

To sum up: the i386 port gets the sector boundary check wrong by a factor
of sector-size / DEV_BSIZE.  (I haven't looked at the other ports to see
whether they have similar bugs.)  ffs_mountfs(9) is broken WRT the location
of the super-block; it passes a sector number to bread, which is expecting
a DEV_BSIZE block number.  There were also several smaller problems with
mkfs computing fs fields in the superblock in terms of sector size; I don't
recall what they were offhand, but they'll be easy to recreate if you're
interested.

>
>There are two things in play.
>
>1) Right now you can define DEV_BSIZE to 2048, build a kernel (and
>userland), and run on 2k disks. You can make file systems, and they work
>fine.
>
>You can not however (simply) use any disks that have sector sizes other
>than DEV_BSIZE (*). Makes using floppies on DEV_BSIZE 2k systems not work.
>:-|
>
>(*) some drivers, like cd and the hpib/gpib disk driver for hp300 fake
>disks of one sector size (2k and 256 byte respectively) to look like
>another.
>
>2) We want to be able to use non-DEV_BSIZE disks with NetBSD. So that we
>can use 2k disks, and we can move disks from system to system. :-)
>
>I think that when you add a non-DEV_BSIZE disk to a system, our FS tools
>make/read the file system exactly as a kernel with DEV_BSIZE==that disk
>size would.

I'm sorry; trying to figure that one is giving me a headache again. :^P

>
>So in a way we define a compatability ourselves. :-)
>
>> By the way, I've tried FreeBSD, but couldn't get my DVD-RAM drive to work
>> at all (which also has 2048-byte sectors).  I _was_ able to get DVD-RAM and
>> 640MO disks formatted and written using Linux with the ext2 file system,
>> BUT Linux (at least my version) has a nasty problem with its buffer cache
>> not always getting written out to disk!  I've had at least half a dozen
>> files irrecoverably corrupted before I discovered there was a problem.
>> Still, if anybody is interested, and can tell me what blocks are relevant,
>> I can send you a dump of some of the file system blocks on those disks.
>
>Doh, yea for linux!
>
>> >Oh, please be careful, your mailer mangled them (it wrapped lines for
>> >you).
>>
>> Oops!  Apologies.  Funny, though... I didn't think my mailer (Eudora) did
>> line wrapping... they aren't wrapped in the copy I saved in my Out box... I
>> can't find any options for it either.  Oh, well.  If I end up re-posting my
>> patches, I'll try a different approach next time.
>
>Weird. I checked my INBOX to make sure that pine wasn't messing me up, and
>the lines were wrapped even there. ??

No, it's definitely not you; I checked my mail backups at my ISP, and lines
were definitely wrapped after they left the tech-kern mailing list.  I sent
myself another message directly, and it turns out to be my mailer after all.
 :-(

>
>> [most patch lines omitted to get to the comments]
>> >> @@ -446,6 +459,12 @@
>> >>  	sblock.fs_npsect = nphyssectors;
>> >>
>> >> sblock.fs_postblformat = FS_DYNAMICPOSTBLFMT;
>> >>  	sblock.fs_sbsize =
>> >> fragroundup(&sblock, sizeof(struct fs));
>> >> +	/*
>> >> +	 * Rotational tables have
>> >> been removed in favor of simplicity.
>> >> +	 */
>> >
>> >Uhm, that's not something we should be seeing in a diff dealing with block
>> >sizes. While we may be intreested in dropping rotational table support,
>> >the change shouldn't just slip in.
>>
>> This is a case where I should have followed my own advice and given more
>> detailed documentation.  The reason I removed rotational table calculations
>> (which, BTW, would only affect newly created file systems) is that the code
>> contains an awful lot of dependency on the physical structure of the disk.
>> Just thinking about how to fix it to work with the actual sector size was
>> giving me a headache. :P
>
>It should work with the actual sector size now. It just won't transition
>to DEV_BSIZE too well.

It's just something I have to visualize and understand before I can trust
it.  Maybe if I tried to figure it out on paper...

>
>[snip]
>
>> >I can't quite tell, but I want to make sure you've only changed the
>> >comments; this structure is used on-disk, and if we tweak it we have to
>> >version the FS.
>>
>> Yes.  I figured since <ufs/ffs/fs.h> is a public data structure, I wouldn't
>> dare make any changes to either the data types or member names, for fear of
>> breaking existing problems.  But I did want to add / clarify some of the
>> comments that I felt were too vague.  (Okay, so that last line above
>> probably isn't any clearer.  You can bop me for that. ;-)
>
>That's fine.
>
>> >Overall, I think it'd be easier to leave the FS code in terms of
>> >underlying sectors, and just be clear when we're looking at a sector
>>
>> I strongly disagree here.  Note that in the above segment, I only modified
>> the comment, not the code; and I didn't even change the meaning of the
>> comment at that.
>
>Ahh. Sorry, ok.
>
>> After looking at the fs code again, it may not actually hurt anything if we
>> changed the implementation of fs_nspf to use sector size instead of
>> DEV_BSIZE, because it appears to be used only in calculations involving the
>> rotation tables.  *BUT* the mkfs program currently uses fs_nspf to
>> initialize fs_fsbtodb, and the latter *definitely* breaks the fs if set in
>> terms of sector size (as evidenced by my first failed attempt to create a
>> file system on a 2048-bps virtual disk).
>
>Actually, I think the problem you saw is different. There is another fix.
>
>The problem is fs_fsbtodb and its use when we have block offsets in units
>other than disk sectors. fs_fsbtodb (as I recall) is used both to
>translate file system blocks <-> physical sectors, and file system blocks
><-> buffer cache offsets.
>
>In the work I did, that was fine since the two were the same. But the
>solution we're looking at here they are different.
>
>I think the thing to do is find uses of fs_fsbtodb and separate them into
>uses to determine an offset to disk (i.e. where newfs put something) and
>uses to determine an offset in the buffer cache. The latter should use
>"fs_fstoDEV_BSIZEb" - an offset we will make up.

Ouch!  There's a *lot* of code (well, 49 lines in ffs) that uses fsbtodb or
dbtofsb, and you say they may not all convert to/from b_blkno??  Well, I'll
look into it -- carefully.  But since most of the calls to fsbtodb and
dbtofsb use DEV_BSIZE blocks, wouldn't it be easier just to leave those
alone and change any that expect sector sizes?

>
>> OTOH, as you pointed out, DEV_BSIZE isn't a constant when transporting a
>> disk between arch's, so the use of fs_fsbtodb will break in that case too.
>> A lot of my changes depended on DEV_BSIZE being constant.
>
>NetBSD defaults to DEV_BSIZE == 512 on all arches. It's more a question of
>moving disks to systems that have been tweaked.
>
>> But the reason I'd like to have the FS code use DEV_BSIZE units instead of
>> sectors is because, as you said earlier, all block I/O functions use units
>> of DEV_BSIZE.  Again, I think the 'right' thing to do would be to separate
>> the file system from the lower layers; in this case, ignore fs_fsbtodb as
>> read from the disk, and have its value computed on the fly instead as
>> [fs_fshift - log2(DEV_BSIZE)].  I think that would work in all cases.  (Let
>> me know if I overlooked something... it seems too simple.)  Most of the ffs
>> code in fact uses frags, not sectors, so these types of changes will
>> probably be minimal.
>
>Heh. That's what I suggested above. Yes, I think this is the best way.
>
>> >Since ffs (AFAICT) reads & writes in fsblock ("fragment") sizes, if we
>> >require fsblocks be at least as big as a disk block (quite reasonable),
>> >then it won't matter for most of the code.
>>
>> That's already enforced by mkfs. :)
>>
>> It made me think of something else, though: most of the code assumes that
>> the sector size is >= (if not ==) DEV_BSIZE.  I did spot one little section
>> of code that took another branch if the sector size < DEV_BSIZE, but
>> AFAIKT, using disks with 256 bytes per sector would break most of the other
>> code.  Would we want to support sector size < DEV_BSIZE at all?
>
>The only 256-byte disks we support fake access in terms of 512-byte
>sectors. :-)
>
>I think that code would more be fore making an ffs on a floppy (512-byte
>sectors) on a system using DEV_BSIZE > 512.

Thanks for the feedback!

-----------------------
Trevin Beattie          "Do not meddle in the affairs of wizards,
trevin@xmission.com     for you are crunchy and good with ketchup."
      {:->                                     --unknown