tech-kern: >2T filesystems, redux

Subject: >2T filesystems, redux
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 12/20/2005 12:44:40
Some of you may recall that I've been playing with filesystems over 2T
recently.

I can now report that it doesn't work right. :(

RAIDframe works; it provides a "disk" of some some 2.73 TB, and while
disklabel seems incapable of dealing with it, it works fine when I test
/dev/rraid0d.

I can newfs it, under either ffsv1 or ffsv2.  But then when I use my
tester program (which writes data blocks whose contents uniquely
identify the block in question) on a >2T file in the filesystem, it
doesn't work.

With ffsv1, I wrote the file, unmounted the filesystem, and upon
running fsck it had large numbers of errors; I saw lots of "DUP/BAD"
block numbers.  Taking the block numbers and converting them to 4-byte
character strings, they were obviously extracts from data blocks
(though not long enough extracts to identify which blocks they were
from).  So I assumed that an indirect block got stomped on by a data
block (or an inode block, but that would almost certainly have produced
a bunch more errors).

With ffsv2, I wrote the file unmounted the filesystem and fsck was
happy.  So I remounted and ran my program in check mode, in which,
rather than writing, it reads and compares against what it would have
written.

Everything was fine for the first 11.424+ GB, then it started showing
errors.  Looking at the first block in error of the test file - block
#23958144 - it contains a data block from the ffsv1 test!  (The program
can also be made to put identifying strings into the blocks, so I can
tell which "write" run a given block came from.)

Here's what I see:

[Backup - root] 78> dd bs=512 count=1 if=/mnt/testfile skip=23958143
raid0d-ffsv2-2005-12-18...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143...............23958143.....>1+0 records in
1+0 records out
512 bytes transferred in 0.022 secs (23272 bytes/sec)
[Backup - root] 79> dd bs=512 count=1 if=/mnt/testfile skip=23958144
raid0d-file-20051216640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.............1608736640.....>1+0 records in
1+0 records out
512 bytes transferred in 0.001 secs (512000 bytes/sec)
[Backup - root] 80> 

The first block (skip=23958143) is correct; the second is the one I
refer to, which the "raid0-file-20051216" tag indicates came from the
ffsv1 test.

This strikes me as very odd, because it means that the blocks written
by the write phase are different from the blocks read by the read
phase.  I can only conjecture that there's a rather strange bug
somewhere in one of those code paths but not the other.

And yes, I was careful to use large block and frag sizes (for both
filesystems, even though I think only ffsv1 has inherent issues with
more than 2^32 frags in a filesystem).  I believe I used 8K frags and
64K blocks; I certainly did for the ffsv2 run, and I think I copied the
values from the ffsv1 filesystem when I made the ffsv2 filesystem.

On rereading, I see my terminology above is a little confusing; "block"
refers to too many things.  When writing of my test program, a "block"
is 512 bytes, and this has nothing to do with the fs_bsize or fs_fsize
of any filesystem.

Any thoughts on how to make this work?  I can probably use the
filesystem for more testing for a few days, but if I can't resolve this
within a week or so it'll probably have to go back into production as
two smaller filesystems - the space we're using in its stead right now
is a bit too small, and is getting full.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B