Subject: >2T and 3.0: no joy
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 01/03/2006 11:48:30
Okay, prompted by a note here, I tried 3.0 on the machine with that >2T
filesystem.  No success yet.

ffsv1 fails.  I made the filesystem, mounted, and created a test file:

# newfs -s 5860701440 -F -f 8192 -b 65536 -i 1048576 /dev/rraid0d
...
# mount /dev/raid0d /mnt2
# touch /mnt2/testfile

Then I ran my checker program in write mode on /mnt2/testfile.  (/mnt2
rather than /mnt because the way I used to boot from 3.0, /mnt was
already busy.)  When it completed, I unmounted the filesystem and ran
fsck, and it failed:

# fsck_ffs -f /dev/rraid0d
** /dev/rraid0d
** File system is already clean
** Last Mounted on /mnt2
** Phase 1 - Check Blocks and Sizes
INCORRECT BLOCK COUNT I=3 (1205368576 should be 5500335872)
CORRECT? [yn] 

This despite my having created the filesystem with 8K frags and 64K
blocks.  I even checked with dumpfs, and it agreed:

# dumpfs /dev/rraid0d | head
file system: /dev/rraid0d
endian  little-endian
magic   11954 (UFS1)    time    Sat Dec 31 01:26:09 2005
superblock location     8192    id      [ 43b4009b 8732de2 ]
cylgrp  dynamic inodes  4.4BSD  sblock  FFSv2   fslevel 4
nbfree  2808990 ndir    1       nifree  2533884 nffree  13
ncg     707     size    366293840       blocks  366242926
bsize   65536   shift   16      mask    0xffff0000
fsize   8192    shift   13      mask    0xffffe000
frag    8       shift   3       fsbtodb 4

Then I tried ffsv2:

# newfs -O 2 -s 5860701440 -F -f 8192 -b 65536 -i 1048576 /dev/rraid0d

and the same mount-and-test drill.  This time, fsck was happy with the
filesystem, but when I (re)mounted it and did the read phase of the
test, 2560 sectors failed the check.  (I used -F because disklabel does
not deal well with partitions this large, since it has 32-bit limits,
and I didn't want newfs to get confused.)

They are blocks of 128 sectors starting at (sector) offsets 159659904,
385602944, 411266944, 853581056, 1073089536, 1552911744, 1692143744,
1870314624, 2073145984, 3861888768, 3890336768, 3899723008, 3908017408,
4377027328, 4655586176, 4810056576, 4850685056, 4918631296, 5129675136,
and 5378495360.  128 sectors is fs_bsize but is only 1/5 of the
RAIDframe stripe size, so I find it much more plausible that it's the
filesystem code's fault.  (Besides, under 2.0, /dev/rraid0d tested
clean, though admittedly I didn't repeat those tests under 3.0.)

The really interesting thing is that one of those 128-sector blocks
reads with data that doesn't come from the write phase at all; those
sectors return data left over from a previous run, when I was testing
/dev/rraid0d directly.  This means that either the writes didn't make
it to the disk or the read and write code paths ended up operating on
different disk sectors.  They're the sectors from 5129675136 through
5129675263, out of 5500000000, too, and thus are over 150G from the end
of the writing, too far for it to be plausible they could have remained
in unflushed buffers (as an explanation for not making it to the disk).

Since live use would not involve single files over 2T, it's possible
ffsv1 will do.  I'm going to rerun the ffsv1 tests with multiple
smaller files instead of one huge file; if that passes, we can use
ffsv1.  (I never tried a read phase with ffsv1, since fsck was unhappy.)

I have saved the first 30 lines of dumpfs output from the ffsv2
filesystem, in case anyone wants them.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B