Subject: Re: NetBSD, apple fibre-channel card & 2.8TB Xserve-RAID
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 12/05/2004 15:03:29
[ On Saturday, December 4, 2004 at 18:41:48 (-0500), der Mouse wrote: ]
> Subject: Re: NetBSD, apple fibre-channel card & 2.8TB Xserve-RAID
>
> >> (1) Create a big file. [...]
> >> (2) Compress this file. I used gzip --fast [...]
> >> (3) Uncompress the file to /dev/null. Do you get an error? I do.
> > Yikes! Indeed I do!
>
> > a: 2147483647 0 4.2BSD 2048 16384 27 # (Cyl. 0 - 131071*)
>
> > [console]<@> # gzcat < biggest-file.gz > /dev/null
> >
> > gzcat: stdin: invalid compressed data--length error
> > [console]<@> #
>
> I find this _very_ disturbing. Your partition is actually below the
> 1TB mark, albeit not by much, which means that most of my theories -
> and I think the theory put forth about it being a sign-extension bug in
> the FFS code - are now out the window.
If it weren't for some of the other symptoms I think I remember that you
mentioned then I would say this is just a gzip problem.
You said it worked with 1.5GB, and I see that it does too.
I realized just as I was going to sleep last night (er, this morning)
that my *.gz input file is over 2^32 bytes:
# BLOCKSIZE=1048576 ls -ls
total 22703
1111 -rw-r--r-- 1 root wheel 1163765834 Nov 27 07:06 bigger-file
13454 -rw-r--r-- 1 root wheel 14103459216 Nov 27 07:14 biggest-file
4548 -rw-r--r-- 1 root wheel 4767516559 Nov 27 07:50 biggest-file.gz
12 -rw-r--r-- 1 root wheel 11522434 Nov 27 06:59 little-file
3581 -rw-r--r-- 1 root wheel 3753902080 Oct 4 22:57 test.zero.dd
I'm betting that gzip is unable to read input from files more than 4GB
in size. Indeed I seem to remember hearing rumours about this kind of
problem before too.
(note first that my gzip is from the 1.6.x userland -- I suppose I
should test Mathew's version just to be sure it works though....)
Ah yes, here's an example:
http://lists.freebsd.org/pipermail/freebsd-questions/2004-February/037428.html
Ah ha! Indeed there's a relevant change in the FreeBSD gzip sources:
in unzip.c:
----------------------------
revision 1.7
date: 2004/05/02 02:54:37; author: tjr; state: Exp; lines: +14 -10
Apply patch from gzip web page to correctly decompress files larger than
4GB on architectures with 64-bit long integers.
----------------------------
and this one in gzip.h also seems related:
----------------------------
revision 1.4
date: 2004/05/02 23:07:49; author: obrien; state: Exp; lines: +6 -3
Gzip assumes 'unsigned long' is 32-bits wide and depends on this.
One thing Gzip does is implicitly by store the size of a file into an
'unsigned long' rather than explicitly compute the remainder modulo 2^32
(see RFC 1952 section 2.3.1 "ISIZE"). Thus an extracted file size is
does not equal the original size (mod 2^32) for files larger than 4GB.
This manifests itself in errors such as:
zcat: bigfile.gz: invalid compressed data--length error
PR: 66008, 66009
Submitted by: Peter Losher <Peter_Losher@isc.org>
Patch by: tjr
----------------------------
There's another possibly relevant fix in their inflate.c:1.9 too.....
If I were to make a copy of "biggest-file" and then compare the copy and
the original with "cmp" then I expect them to be reported as identical.
Unfortunately the damn system hung solid and I had to hit the virtual
halt button via the remote management controller in order to dump it
into the debugger. It looks like this might be some kind of SMP
deadlock, but since this is not a LOCKDEBUG kernel it's harder for me to
tell what's wrong. (LOCKDEBUG is antithetical to ethernet driver
througput benchmarking :-)
Note that this was running a recent -current kernel, but one without
some of the recent not-yet-committed SMP fixes for alpha either so I can
well believe that heavy I/O through the filesystem would indeed cause
exactly this kind of lockup, just as it used to do on my as4000 running
1.6.x before I applied those fixes.
[console]<@> # cp biggest-file biggest-file.copy
^[^[rmc
RMC>halt in
Returning to COM port
halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted
halt code = 1
operator initiated halt
PC = fffffc0000451f38
P00>>>^[^[rmc
RMC>
RMC>halt out
Returning to COM port
P00>>>cont
continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 446.1 (cp) at netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> where
No such command
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x208
console_restart() at netbsd:console_restart+0x78
XentRestart() at netbsd:XentRestart+0x90
--- console restart (from ipl 6) ---
schedclock() at netbsd:schedclock+0x88
interrupt() at netbsd:interrupt+0x1a8
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 4) ---
pool_do_put() at netbsd:pool_do_put+0x70
pool_put() at netbsd:pool_put+0x48
scsipi_put_xs() at netbsd:scsipi_put_xs+0x4c
scsipi_complete() at netbsd:scsipi_complete+0x1c8
scsipi_done() at netbsd:scsipi_done+0x210
isp_done() at netbsd:isp_done+0xc4
isp_intr() at netbsd:isp_intr+0x54c
isp_pci_intr() at netbsd:isp_pci_intr+0x90
alpha_shared_intr_dispatch() at netbsd:alpha_shared_intr_dispatch+0x6c
dec_6600_iointr() at netbsd:dec_6600_iointr+0x4c
interrupt() at netbsd:interrupt+0x334
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x190
pmap_remove_mapping() at netbsd:pmap_remove_mapping+0x13c
pmap_do_remove() at netbsd:pmap_do_remove+0x4fc
pmap_remove() at netbsd:pmap_remove+0x1c
ubc_alloc() at netbsd:ubc_alloc+0x48c
ffs_read() at netbsd:ffs_read+0x39c
vn_read() at netbsd:vn_read+0x128
dofileread() at netbsd:dofileread+0xb4
sys_read() at netbsd:sys_read+0xac
syscall_plain() at netbsd:syscall_plain+0xdc
XentSys() at netbsd:XentSys+0x60
--- syscall (3) ---
--- user mode ---
db{0}>
--
Greg A. Woods
+1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>