tech-kern: Re: NetBSD, apple fibre-channel card & 2.8TB Xserve-RAID

Subject: Re: NetBSD, apple fibre-channel card & 2.8TB Xserve-RAID
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 12/04/2004 03:00:04
>> Is anyone using filesystems over 1TB successfully?
> As far as I can tell it is simply not possible on NetBSD, at least
> not on any released version.   :-)

:-(

>> How about with FFSv1?
> Did you mean "v2"?

No.  That follow-on question was written assuming the answer to the
former would be "yes", based on this text (which I found in
http://www.netbsd.org/Misc/features.html):

   NetBSD has shipped with 64-bit filesystems since the 1.0 release in
   October 1994. Under NetBSD berkeley fast filesystems can be up to 4TB
   (4096GB) in size, on both 64 and 32 bit machines. Files and user file
   quotas can also reach terabytes. Many other systems limit filesize to
   4GB on 32bit machines.

   An ffs can have up to 2^31 fragment blocks - the maximum filesystem
   size is dependent on the fragment size:
   Frag size fs size
   512 bytes 1 TB
   1kB 2 TB
   2kB 4 TB

Ignoring the mangled table (which is mostly because that was formatted
by code that doesn't really understand tables), it's fairly clear that
filesystems over 1TB are _supposed_ to work.

> Either way there's still the issue of the disklabel, both in-kernel
> and on-disk as far as I can tell at the moment, only being able to
> represent 2^31-1 sectors,

I saw no obvious problems with disks and partitions over 1T with
2.0RC4.  I am not able to try going above 2T, though.

> I tried ignoring the fact that the disklabel userland code reported
> negative numbers

This sounds like my own experiences with the 1.6.2 disklabel.  Did you
try any of the 2.0 RCs?

>> we suspect there is something going wonky at the 1TB mark, either
>> it's reading/writing the wrong sectors, or the buffer cache is
>> getting confused, or some such.
> At, or beyond, the 2^31-1 sector mark, do you think?

Yes, I suspect the breaking point is 1TB because it's 2^31 half-K
sectors.  I have no reason to think that a filesystem of 2^31-1 sectors
differs significantly from a filesystem of 2^31 sectors (assuming a
fragsize over 512; at 512, exactly 2^31 sectors may well run into a
sign-extension bug for all I know).

>> The most likely candidate [for the bug] to my mind is some kind of
>> 32-/64-bit bug, possibly sign-extending when it should be
>> zero-extending, or maybe using a 32-bit datatype (maybe
>> inadvertently) where a 64-bit type is called for.
> Perhaps such a bug is avoided on alpha and other LP64 systems?

Maybe.  I'd be more confident on an ILP64 system, but I don't think
NetBSD has any of those yet.  I'm not at all sure it is a 32-/64-bit
bug, though.

>> But much of the weirdness I recently reported on tech-kern with
>> directories that appear fine to fsck but the kernel acts bizarre
>> with can be rather neatly explained by assuming such bugs in the
>> block device code paths, or perhaps the filesystem disk-interface
>> code paths.
> If you can suggest any way to reproduce such weirdness

(1) Create a big file.  In my case, I ran "btoa < /netbsd > z", and
then catted enough copies of z together to exceed 10G.

(2) Compress this file.  I used gzip --fast; someone else I've been
corresponding with says exactly what the compression program is is
irrelevant, as long as it has sanity checks on uncompression.

(3) Uncompress the file to /dev/null.  Do you get an error?  I do.

I first tried this with 1.5G instead of 10G, and it didn't error.  The
compressed file was slightly less than the machine's RAM, though (it
has 1G ram, and the file was 1.01+e9 bytes, some tens of megs less than
the RAM).  I now suspect that it is important that cg 0 fill up more
than that the compressed file be larger than RAM.

However, I have in mind a much simpler test: fill the entire "disk"
(it's a "hardware" (presumably really firmware) RAID array) with data
such that by examining a block's contents you can tell what block it
is.  Then read it all back and see if all the blocks' contents are
correct. Repeat using the block device if the raw device passes this
test (I expect the raw device to pass and the cooked device to fail).

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B