tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
raidframe oddity, take three (kernel panic!)
>>>> [RAIDframe parity rebuild finished at 87%]
>>> Any ideas [...]
>> [...]
> But never mind; I ran raidctl -s on it and saw something unexpected.
> It turns out one of the underlying disks threw an I/O error, [...]
Now, I'm having trouble rebuilding. (Context: 4.0 i386; the disks are
931G - 1 disk-maker TB - SATA, on a 12-port twe configured JBOD.)
I did some tests, which convinced me the underlying disk had problems.
The RAID 5 was built atop nine RAID1s each with only one member.
So I replaced the failed drive and reconfigured the corresponding
RAID1. raid0, of course, still thinks raid10e is sick and is running
degraded. So I did "raidctl -R /dev/raid10e raid0" and boom!
raid0: initiating in-place reconstruction on column 6
panic: malloc: out of space in kmem_map
This has now happened four times, each time instantly upon my running
raidctl -R, so I am convinced there is a direct causal relation between
the reconstruction start and the panic. The stack trace according to
ddb (scribbled down, not cut-and-pasted) goes panic, free,
rf_MakeReconMap, rf_MakeReconControl, rf_ContinueReconstructFailedDisk,
rf_ReconstructInPlace, rf_ReconstructInPlaceThread.
Obviously, RAID is of minimal value if it's impossible to reconstruct
onto a failed-and-replaced drive. Is this a bug, is it a kernel config
parameter I need to tweak, is it just effectively impossible to use
RAIDframe RAID5 for a RAID this big (7.27+TB) with as little RAM as the
machine has (1G), should I lose the "RAID 5 atop RAID 1" trick, what?
I have the kernel coredump and the corresponding kernel for the last
crash; the kernel was configured with `makeoptions DEBUG="-g"', so it
has debugging symbols, and I've saved the netbsd.gdb that goes with
that coredump.
raidctl -G on the RAID5 says
# raidctl config file for /dev/rraid0d
START array
# numRow numCol numSpare
1 9 0
START disks
/dev/raid4e
/dev/raid5e
/dev/raid6e
/dev/raid7e
/dev/raid8e
/dev/raid9e
/dev/raid10e
/dev/raid11e
/dev/raid12e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
16 1 1 5
START queue
fifo 100
All the disklabels in question (on the drives and on raid4 through
raid12) have just one partition, of type RAID, which covers all but a
tiny sliver at the beginning of the disk. (On the real disks, the
offset is 128 sectors; on raid4 through raid12, it's 64 sectors.)
Since the RAID does not yet have any real data in it, I'm just wiping
it and doing a parity reinit (raidctl -i), on the assumption that the
problem with raidctl -R is something relatively easy to fix. (And even
if it's not, starting a parity reinit doesn't _hurt_ anything.)
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Home |
Main Index |
Thread Index |
Old Index