Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 03/13/2003 07:53:58
Brian Buhrow writes:
> 	Hello Greg.  Here's another data point.  I believe I've found another
> condition which can cause a similar hang.  Because of the panics I've been
> getting, and writing about on another thread on the list, I've been running
> the parity checker a lot.  This evening, when the machine paniced, it
> rebooted, and began to run normally, but after just a few minutes, it hung,
> just like when paging was enabled to the raid 5 device.  There was no
> paging to the raid 5 device, however, so I wondered what it might be. 

Where were you paging to at the time of the hang?

> Then
> I remembered that when this machine starts up, it runs bind, which fires up
> about 200 zone transfers for domains I secondary.  So, I suspect that the
> combination of creating, modifying and deleting alarge number of small
> files while the parity checker is running can lead to the same kind of
> starvation condition.

Hmm...  I don't suppose you could run a kgdb-enabled kernel, hook up another 
machine to it, and then be able to see what the processes are waiting on at
the time of the hang?
 
> My setup:
> /dev/rraid0 with 11 partitions, 5 of them mounted simultaneously.
> Softdep is disabled on all filesystems.  The raid is a 3-drive raid5 set.

Can you send me:
 a) a copy of 'cat /var/run/dmesg.boot'
 b) a copy of the output of 'raidctl -s' for each of the RAID sets 
(or the corresponding raid*.conf files)
 
> Guess on how to repeat:
> 
> 1.  Write a script which creates a new file, puts a few hundred bytes in
> it, renames it, and then deletes it.
> 
> 2.  Start the parity checker -- I don't know how to force a check if one
> isn't neded, but I bet there's a way. :)

'raidctl -i' will do that :)
 
> 3.  Run about 20 instances of your script, possibly more.  I've not counted
> the number of named-xfer's going on at once on this machine, but I believe
> it's more than 20, less than 100.

This is on your 128MB machine, right?  I wonder how much kernel memory is 
being used up by network buffers in this case...

> 	My guess, before long, you'll get a hang.

That wouldn't be good :(  

Later...

Greg Oster