Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 03/13/2003 07:53:58
Brian Buhrow writes:
> Hello Greg. Here's another data point. I believe I've found another
> condition which can cause a similar hang. Because of the panics I've been
> getting, and writing about on another thread on the list, I've been running
> the parity checker a lot. This evening, when the machine paniced, it
> rebooted, and began to run normally, but after just a few minutes, it hung,
> just like when paging was enabled to the raid 5 device. There was no
> paging to the raid 5 device, however, so I wondered what it might be.
Where were you paging to at the time of the hang?
> Then
> I remembered that when this machine starts up, it runs bind, which fires up
> about 200 zone transfers for domains I secondary. So, I suspect that the
> combination of creating, modifying and deleting alarge number of small
> files while the parity checker is running can lead to the same kind of
> starvation condition.
Hmm... I don't suppose you could run a kgdb-enabled kernel, hook up another
machine to it, and then be able to see what the processes are waiting on at
the time of the hang?
> My setup:
> /dev/rraid0 with 11 partitions, 5 of them mounted simultaneously.
> Softdep is disabled on all filesystems. The raid is a 3-drive raid5 set.
Can you send me:
a) a copy of 'cat /var/run/dmesg.boot'
b) a copy of the output of 'raidctl -s' for each of the RAID sets
(or the corresponding raid*.conf files)
> Guess on how to repeat:
>
> 1. Write a script which creates a new file, puts a few hundred bytes in
> it, renames it, and then deletes it.
>
> 2. Start the parity checker -- I don't know how to force a check if one
> isn't neded, but I bet there's a way. :)
'raidctl -i' will do that :)
> 3. Run about 20 instances of your script, possibly more. I've not counted
> the number of named-xfer's going on at once on this machine, but I believe
> it's more than 20, less than 100.
This is on your 128MB machine, right? I wonder how much kernel memory is
being used up by network buffers in this case...
> My guess, before long, you'll get a hang.
That wouldn't be good :(
Later...
Greg Oster