Subject: Re: 2.0 and >2T filesystems
To: None <tech-kern@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 11/24/2005 15:09:47
>> "fsbtodb(di->di_db[x])" which [...] needs to be something more like
> > "fsbtodb((u_int64_t)di->di_db[x])" to make it 32-bit-clean.
> (daddr_t) might be more correct -

Ummm...yes, I'd say you're right.

> but that doesn't look good...

Well, no, *if* that's what's wrong.

>> I'm doing more tests.  But they're slow - slinging multiple
>> terabytes around is - so it'll be a while before I have any results.
> Since the issue is accessing sectors above 4T, maybe this can be done
> by creating a filesystem with a small number of inodes, and filling
> it with small files.  IIRC the data blocks are preferentially
> allocated from the 'cylinder group' that contains the inode.

Probably.  But first, I want to make sure the disk subsystem itself is
clean, so I'm running a test that writes a distinctive data pattern to
each data block; I'll then rerun it in check mode, where it reads and
makes sure it gets what it would have written.  The data pattern is
such that a block's contents identify the block; for example,

% dd if=/dev/rraid0d bs=512 count=1 skip=12345 | hexdump -C
00000000  3c 2e 2e 2e 2e 2e 2e 2e  2e 2e 2e 2e 2e 2e 2e 2e  |<...............|
00000010  2e 2e 31 32 33 34 35 2e  2e 2e 2e 2e 2e 2e 2e 2e  |..12345.........|
00000020  2e 2e 2e 2e 2e 2e 2e 2e  2e 31 32 33 34 35 2e 2e  |.........12345..|
[...]
000001c0  2e 2e 2e 2e 2e 2e 2e 31  32 33 34 35 2e 2e 2e 2e  |.......12345....|
000001d0  2e 2e 2e 2e 2e 2e 2e 2e  2e 2e 2e 2e 2e 2e 31 32  |..............12|
000001e0  33 34 35 2e 2e 2e 2e 2e  2e 2e 2e 2e 2e 2e 2e 2e  |345.............|
000001f0  2e 2e 2e 2e 2e 31 32 33  34 35 2e 2e 2e 2e 2e 3e  |.....12345.....>|
00000200

So, if the disk subsystem folds blocks onto one another, I can tell
which block landed on top of the block I read.  (I see no point in even
trying to look for issues in the filesystem code if the disk subsystem
isn't free of this kind of error.)  I did just now try reading block
4294967296 with dd skip=, and I did not get the data I get with skip=0,
so it's not *totally* busted.

I know there were issues at the 1T point, the point at which signed
32-bit sector numbers become negative, and mycroft's patch made those
go away.  My current test run is to see if there are any issues at the
2T point, where *un*signed 32-bit sector numbers wrap.  If this passes,
I'll (provisionally) consider the disk subsystem clean and move on to
tests on the filesystem layer.  You say ffs1 is supposed to work with
large enough frags, but that's known to break; I'll try ffs2, which
should help narrow down where the error is (ie, either in code that's
shared or code that's not, depending on whether it works).

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B