raidframe oddity, take three (kernel panic!)

To: tech-kern%netbsd.org@localhost
Subject: raidframe oddity, take three (kernel panic!)
From: der Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Sat, 6 Dec 2008 13:53:33 -0500 (EST)

>>>> [RAIDframe parity rebuild finished at 87%]
>>> Any ideas [...]
>> [...]
> But never mind; I ran raidctl -s on it and saw something unexpected.
> It turns out one of the underlying disks threw an I/O error, [...]

Now, I'm having trouble rebuilding.  (Context: 4.0 i386; the disks are
931G - 1 disk-maker TB - SATA, on a 12-port twe configured JBOD.)

I did some tests, which convinced me the underlying disk had problems.
The RAID 5 was built atop nine RAID1s each with only one member.

So I replaced the failed drive and reconfigured the corresponding
RAID1.  raid0, of course, still thinks raid10e is sick and is running
degraded.  So I did "raidctl -R /dev/raid10e raid0" and boom!

raid0: initiating in-place reconstruction on column 6
panic: malloc: out of space in kmem_map

This has now happened four times, each time instantly upon my running
raidctl -R, so I am convinced there is a direct causal relation between
the reconstruction start and the panic.  The stack trace according to
ddb (scribbled down, not cut-and-pasted) goes panic, free,
rf_MakeReconMap, rf_MakeReconControl, rf_ContinueReconstructFailedDisk,
rf_ReconstructInPlace, rf_ReconstructInPlaceThread.

Obviously, RAID is of minimal value if it's impossible to reconstruct
onto a failed-and-replaced drive.  Is this a bug, is it a kernel config
parameter I need to tweak, is it just effectively impossible to use
RAIDframe RAID5 for a RAID this big (7.27+TB) with as little RAM as the
machine has (1G), should I lose the "RAID 5 atop RAID 1" trick, what?

I have the kernel coredump and the corresponding kernel for the last
crash; the kernel was configured with `makeoptions DEBUG="-g"', so it
has debugging symbols, and I've saved the netbsd.gdb that goes with
that coredump.

raidctl -G on the RAID5 says

# raidctl config file for /dev/rraid0d

START array
# numRow numCol numSpare
1 9 0

START disks
/dev/raid4e
/dev/raid5e
/dev/raid6e
/dev/raid7e
/dev/raid8e
/dev/raid9e
/dev/raid10e
/dev/raid11e
/dev/raid12e

START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
16 1 1 5

START queue
fifo 100

All the disklabels in question (on the drives and on raid4 through
raid12) have just one partition, of type RAID, which covers all but a
tiny sliver at the beginning of the disk.  (On the real disks, the
offset is 128 sectors; on raid4 through raid12, it's 64 sectors.)

Since the RAID does not yet have any real data in it, I'm just wiping
it and doing a parity reinit (raidctl -i), on the assumption that the
problem with raidctl -R is something relatively easy to fix.  (And even
if it's not, starting a parity reinit doesn't _hurt_ anything.)

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

References:
- raidframe oddity
  - From: der Mouse
- Re: raidframe oddity
  - From: Matthias Scheler
- Re: raidframe oddity
  - From: der Mouse

Prev by Date: Re: ACPI Issues on Lifebook P7120 (with fixes)
Next by Date: Re: raidframe oddity, take three (kernel panic!)
Previous by Thread: Re: raidframe oddity
Next by Thread: Re: raidframe oddity
Indexes:

Home | Main Index | Thread Index | Old Index