Subject: Re: NetBSD, apple fibre-channel card & 2.8TB Xserve-RAID
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 12/04/2004 09:44:59
On Sat, Dec 04, 2004 at 03:00:04AM -0500, der Mouse wrote:
> >> Is anyone using filesystems over 1TB successfully?
> > As far as I can tell it is simply not possible on NetBSD, at least
> > not on any released version.   :-)
> 
> :-(
> 
> >> How about with FFSv1?
> > Did you mean "v2"?
> 
> No.  That follow-on question was written assuming the answer to the
> former would be "yes", based on this text (which I found in
> http://www.netbsd.org/Misc/features.html):
> 
>    NetBSD has shipped with 64-bit filesystems since the 1.0 release in
>    October 1994. Under NetBSD berkeley fast filesystems can be up to 4TB
>    (4096GB) in size, on both 64 and 32 bit machines. Files and user file
>    quotas can also reach terabytes. Many other systems limit filesize to
>    4GB on 32bit machines.
> 
>    An ffs can have up to 2^31 fragment blocks - the maximum filesystem
>    size is dependent on the fragment size:
>    Frag size fs size
>    512 bytes 1 TB
>    1kB 2 TB
>    2kB 4 TB
> 
> Ignoring the mangled table (which is mostly because that was formatted
> by code that doesn't really understand tables), it's fairly clear that
> filesystems over 1TB are _supposed_ to work.

I think the documentation is more an indication that someone did the math
to determine the sizes supported by the on-disk data structures, rather than
an indication that netbsd's implementation ever worked correctly with file
systems that large.

the comment was added in November 2000, which was long before our daddr_t
type became 64 bits.


> >> The most likely candidate [for the bug] to my mind is some kind of
> >> 32-/64-bit bug, possibly sign-extending when it should be
> >> zero-extending, or maybe using a 32-bit datatype (maybe
> >> inadvertently) where a 64-bit type is called for.

while investigating PR 28291, I saw evidence of unwanted sign-extension.
a buffer had

  b_blkno = 0xffffffff800750a0,

which triggers this clause in ufs_strategy():

	if (bp->b_blkno < 0) { /* block is not on disk */
		biodone(bp);
		return (0);
	}

the buffer had a softdep dependency attached, so biodone() tried to process
that, but since the buffer didn't go through spec_strategy(), the i/o-start
part of the dependency processing never happened and the i/o-done softdep
process ends up setting b_data to NULL, which we trip over later.

I think it's due to this (and similar) code in ufs_bmaparray():

			*bnp = blkptrtodb(ump,
			    (int32_t)ufs_rw32(ip->i_ffs1_db[bn],
			    UFS_MPNEEDSWAP(vp->v_mount)));

the check-in comments indicate these casts were added for LFS:

----------------------------
revision 1.24
date: 2003/07/23 13:36:17;  author: yamt;  state: Exp;  lines: +11 -8
cast UFS1 on-disk block pointers to int32_t before assign it to daddr_t.
it's needed for LFS because UNWRITTEN is a negative number.
----------------------------
...
----------------------------
revision 1.20
date: 2003/03/21 15:46:32;  author: fvdl;  state: Exp;  lines: +3 -3
LFS likes to store negative values in the dinode block pointers, so
make sure to cast the value back to int32_t after it was changed
by ufs_rw32, before passing it to blkptrtodb.
----------------------------


I tried removing all those casts to see if that fixed the problem for UFS1,
but it didn't.  I didn't pursue it further.


> > If you can suggest any way to reproduce such weirdness
> 
> (1) Create a big file.  In my case, I ran "btoa < /netbsd > z", and
> then catted enough copies of z together to exceed 10G.
> 
> (2) Compress this file.  I used gzip --fast; someone else I've been
> corresponding with says exactly what the compression program is is
> irrelevant, as long as it has sanity checks on uncompression.
> 
> (3) Uncompress the file to /dev/null.  Do you get an error?  I do.

you could just fill up the fs with copies of a file and see if they
all have the same cksum.  less CPU-intensive than gzip.


> I first tried this with 1.5G instead of 10G, and it didn't error.  The
> compressed file was slightly less than the machine's RAM, though (it
> has 1G ram, and the file was 1.01+e9 bytes, some tens of megs less than
> the RAM).  I now suspect that it is important that cg 0 fill up more
> than that the compressed file be larger than RAM.
> 
> However, I have in mind a much simpler test: fill the entire "disk"
> (it's a "hardware" (presumably really firmware) RAID array) with data
> such that by examining a block's contents you can tell what block it
> is.  Then read it all back and see if all the blocks' contents are
> correct. Repeat using the block device if the raw device passes this
> test (I expect the raw device to pass and the cooked device to fail).

the bugs are likely in the UFS/FFS code, so exercising the specfs code
via block devices from userland won't expose them.

-Chuck