Subject: Re: 3.1_STABLE and SMP
To: NetBSD/i386 Discussion List <port-i386@NetBSD.org>
From: Stephen Borrill <netbsd@precedence.co.uk>
List: port-i386
Date: 04/20/2007 12:46:53
On Fri, 30 Mar 2007, Greg A. Woods wrote:
> At Fri, 30 Mar 2007 13:46:21 -0700, Brian Buhrow wrote:
> Subject: Re: 3.1_STABLE and SMP
>>
>> 	Hello.  I've found the NetBSD-3 branch to be stable ona number of MP
>> based systems, runing in MP mode.  However, there is one case where I can
>> reliably cause the behavior you describe on any hardware with more than one
>> processor running in MP mode.  If you run sendmail on the box, and sendmail
>> is doing any amount of work at all, it will hang hard at some point,
>> whether it be a few hours, or a couple of days.  I'm not sure what it is
>> about sendmail that hangs the box, but it's the only scenario I've been
>> able to introduce on an MP based system which has this effect.  I'm sure
>> it's not sendmail which is the problem, but what ever it does that causes
>> it to be a symptom seems to be fairly unique in the environments I run MP
>> based systems in.

Interesting. Usually the servers in questions run a proxy, NAT and 
sendmail and so any hang kills Internet access for the customers, so they 
tend to want to reboot rather than help us investigate. However, we've 
just installed one which is running sendmail and webmail only at a 
friendly customer's site so hopefully we'll be able to look into it more.

> Do you have a DDB stack backtrace, assuming you can get into DDB from
> the hung state?

I got emailed a screenshot of one:
http://projects.precedence.co.uk/netbsd/ddb1.jpg

(Sorry it's 364K, but it was easier than transcribing it).

> Could you build at your kernel with LOCKDEBUG defined too?  (I prefer
> building all of userland with LOCKDEBUG defined too, just for those few
> kmem grovelers that need to know how LOCKDEBUG affects kernel
> structures, but if the problem is easy to reproduce then you should only
> need to run with the LOCKDEBUG kernel for a short while)

I've not done that yet, but I will do. What extra debug info does this 
give? i.e. what do I tell the customer to do next time it crashes when 
they are running a LOCKDEBUG kernel?

-- 
Stephen