tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: raidframe oddity, take three (kernel panic!)

        Hello.  What version of NetBSD-4 are you running?  Are yu  runing the
straight sources as of 4.0-release?  If so, then raidframe won't work with
such large disks.  You need the 4.0-stable release after June 1, 2008 or
so.  We ran into similar problems unning with large disks with the
4.0-release tree.
You want rf_reconstruct.c or later.

On Dec 6,  1:53pm, der Mouse wrote:
} Subject: raidframe oddity, take three (kernel panic!)
} >>>> [RAIDframe parity rebuild finished at 87%]
} >>> Any ideas [...]
} >> [...]
} > But never mind; I ran raidctl -s on it and saw something unexpected.
} > It turns out one of the underlying disks threw an I/O error, [...]
} Now, I'm having trouble rebuilding.  (Context: 4.0 i386; the disks are
} 931G - 1 disk-maker TB - SATA, on a 12-port twe configured JBOD.)
} I did some tests, which convinced me the underlying disk had problems.
} The RAID 5 was built atop nine RAID1s each with only one member.
} So I replaced the failed drive and reconfigured the corresponding
} RAID1.  raid0, of course, still thinks raid10e is sick and is running
} degraded.  So I did "raidctl -R /dev/raid10e raid0" and boom!
} raid0: initiating in-place reconstruction on column 6
} panic: malloc: out of space in kmem_map
} This has now happened four times, each time instantly upon my running
} raidctl -R, so I am convinced there is a direct causal relation between
} the reconstruction start and the panic.  The stack trace according to
} ddb (scribbled down, not cut-and-pasted) goes panic, free,
} rf_MakeReconMap, rf_MakeReconControl, rf_ContinueReconstructFailedDisk,
} rf_ReconstructInPlace, rf_ReconstructInPlaceThread.
} Obviously, RAID is of minimal value if it's impossible to reconstruct
} onto a failed-and-replaced drive.  Is this a bug, is it a kernel config
} parameter I need to tweak, is it just effectively impossible to use
} RAIDframe RAID5 for a RAID this big (7.27+TB) with as little RAM as the
} machine has (1G), should I lose the "RAID 5 atop RAID 1" trick, what?
} I have the kernel coredump and the corresponding kernel for the last
} crash; the kernel was configured with `makeoptions DEBUG="-g"', so it
} has debugging symbols, and I've saved the netbsd.gdb that goes with
} that coredump.
} raidctl -G on the RAID5 says
} # raidctl config file for /dev/rraid0d
} START array
} # numRow numCol numSpare
} 1 9 0
} START disks
} /dev/raid4e
} /dev/raid5e
} /dev/raid6e
} /dev/raid7e
} /dev/raid8e
} /dev/raid9e
} /dev/raid10e
} /dev/raid11e
} /dev/raid12e
} START layout
} # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
} 16 1 1 5
} START queue
} fifo 100
} All the disklabels in question (on the drives and on raid4 through
} raid12) have just one partition, of type RAID, which covers all but a
} tiny sliver at the beginning of the disk.  (On the real disks, the
} offset is 128 sectors; on raid4 through raid12, it's 64 sectors.)
} Since the RAID does not yet have any real data in it, I'm just wiping
} it and doing a parity reinit (raidctl -i), on the assumption that the
} problem with raidctl -R is something relatively easy to fix.  (And even
} if it's not, starting a parity reinit doesn't _hurt_ anything.)
} /~\ The ASCII                           Mouse
} \ / Ribbon Campaign
}  X  Against HTML    
} / \ Email!         7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
>-- End of excerpt from der Mouse

Home | Main Index | Thread Index | Old Index