Subject: >2T and 3.0: no joy
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 01/03/2006 11:48:30
Okay, prompted by a note here, I tried 3.0 on the machine with that >2T
filesystem. No success yet.
ffsv1 fails. I made the filesystem, mounted, and created a test file:
# newfs -s 5860701440 -F -f 8192 -b 65536 -i 1048576 /dev/rraid0d
...
# mount /dev/raid0d /mnt2
# touch /mnt2/testfile
Then I ran my checker program in write mode on /mnt2/testfile. (/mnt2
rather than /mnt because the way I used to boot from 3.0, /mnt was
already busy.) When it completed, I unmounted the filesystem and ran
fsck, and it failed:
# fsck_ffs -f /dev/rraid0d
** /dev/rraid0d
** File system is already clean
** Last Mounted on /mnt2
** Phase 1 - Check Blocks and Sizes
INCORRECT BLOCK COUNT I=3 (1205368576 should be 5500335872)
CORRECT? [yn]
This despite my having created the filesystem with 8K frags and 64K
blocks. I even checked with dumpfs, and it agreed:
# dumpfs /dev/rraid0d | head
file system: /dev/rraid0d
endian little-endian
magic 11954 (UFS1) time Sat Dec 31 01:26:09 2005
superblock location 8192 id [ 43b4009b 8732de2 ]
cylgrp dynamic inodes 4.4BSD sblock FFSv2 fslevel 4
nbfree 2808990 ndir 1 nifree 2533884 nffree 13
ncg 707 size 366293840 blocks 366242926
bsize 65536 shift 16 mask 0xffff0000
fsize 8192 shift 13 mask 0xffffe000
frag 8 shift 3 fsbtodb 4
Then I tried ffsv2:
# newfs -O 2 -s 5860701440 -F -f 8192 -b 65536 -i 1048576 /dev/rraid0d
and the same mount-and-test drill. This time, fsck was happy with the
filesystem, but when I (re)mounted it and did the read phase of the
test, 2560 sectors failed the check. (I used -F because disklabel does
not deal well with partitions this large, since it has 32-bit limits,
and I didn't want newfs to get confused.)
They are blocks of 128 sectors starting at (sector) offsets 159659904,
385602944, 411266944, 853581056, 1073089536, 1552911744, 1692143744,
1870314624, 2073145984, 3861888768, 3890336768, 3899723008, 3908017408,
4377027328, 4655586176, 4810056576, 4850685056, 4918631296, 5129675136,
and 5378495360. 128 sectors is fs_bsize but is only 1/5 of the
RAIDframe stripe size, so I find it much more plausible that it's the
filesystem code's fault. (Besides, under 2.0, /dev/rraid0d tested
clean, though admittedly I didn't repeat those tests under 3.0.)
The really interesting thing is that one of those 128-sector blocks
reads with data that doesn't come from the write phase at all; those
sectors return data left over from a previous run, when I was testing
/dev/rraid0d directly. This means that either the writes didn't make
it to the disk or the read and write code paths ended up operating on
different disk sectors. They're the sectors from 5129675136 through
5129675263, out of 5500000000, too, and thus are over 150G from the end
of the writing, too far for it to be plausible they could have remained
in unflushed buffers (as an explanation for not making it to the disk).
Since live use would not involve single files over 2T, it's possible
ffsv1 will do. I'm going to rerun the ffsv1 tests with multiple
smaller files instead of one huge file; if that passes, we can use
ffsv1. (I never tried a read phase with ffsv1, since fsck was unhappy.)
I have saved the first 30 lines of dumpfs output from the ffsv2
filesystem, in case anyone wants them.
/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML mouse@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B