Subject: Re: problems with nmbcluster (?)
To: Stephen Jones <smj@cirr.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-net
Date: 01/11/2007 12:35:21
On Wed, Jan 10, 2007 at 05:35:09PM -0800, Stephen Jones wrote:
> Manuel -
> 
> Why is there such a black magic to this?  Is this something that  
> could be handled more
> gracefully with kernel warnings prior to actually hanging?

I think it will print messages about it. Also, it usually doens't hang,
but recovers from this situation.

> Could it  
> be set to increase
> (or decrease) dynamically?

This is just a limit on a memory pool. We could remove the limit, but then
this would make DOS easier. If it's properly tuned for the system's usage
it should be safe. The default value is fine for most usage, I usually
needed to tune it only on system with a lot of outgoing connections.

> 
> Nearly all the NetBSD crashes I experience are related to this, or so  
> I am told, and over
> the years I've never gotten it figured out.  I've cited this as a  
> 'vnlock deadlock' issue,
> but thats just a symptom.  The real issue is resource starvation ..  
> but is NMBCLUSTER a
> spectre or the real ghost?
> 
> One of the big problems is that you might not even get a clue before  
> a system hangs.
> So for me, I see about 18-24 days of uptime prior to inevitable  
> silent hang.  No
> warning, no panic .. just a hang on the NFS server which causes all  
> of the clients
> to cascade vnlock deadlocks.
> 
> Just a few days ago I had a fortunate clue.  I awoke to my phone  
> beeping at me telling
> me of a problem and when I got to the console I was able to break to  
> a debugger and
> kill init to get the NFS server to drop to single user mode.  I was  
> being patient
> hoping that it would eventually recover and give me a shell so I  
> could bring it back up when:
> 
> mclpool limit reached: increase NMBCLUSTERS
> 
> spewed down the screen 50 or so times.  Finally, a real clue and  
> confirmation!  So whats the history
> of this?
> 
> I tried 8192, 16k, 24k, 32k, 64k .. now I'm at 92k, yet still .. I  
> need to increase NMBCLUSTERS.

Ouh, there's a problem here. With that many NMBCLUSTERS it's possible
that you're running in other limits, depending on how much RAM
your system has (92k NMBCLUSTERS is 46MB RAM, non-pageable).

I suspect you're experiencing a mbuf leak here. To help debug this,
please rebuild a kernel with
options MBUFTRACE
and provide the outputs of 
netstat -m
netstat -n
vmstat -m

after a few days of use (or, better, once the network is hung). You can also
get a core dump from the kernel once the limit is reached: reboot -d,
or enter ddb and type reboot(0x104)

-- 
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
     NetBSD: 26 ans d'experience feront toujours la difference
--