tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Problems with hangs under NetBSD-5.x

I think I'm being struck by the same problem on some of our servers 
but not others (or at least much more frequently on some of our 
servers than others).  Two machines especially are prone to this 
(frequency of more than once a week), one being our primary web 
server which now has got a watchdog timer running to reboot it when 
this happens. Other runs rt, mysql and a few other things.

On Thu, 30 Jul 2009, Sverre Froyen wrote:
> > - Can you try to compile a latest kernel from netbsd-5 branch?
> > There is one deadlock fix in VFS layer.
> I have a new netbsd-5 kernel and userland ready and will reboot the
> server tomorrow.

Certainly the two machines mentioned above have kernels post that fix, 
and it still happens to them.

> > - Can you get into DDB and get a backtrace of LWPs which are
> > stuck?
> I can get into DDB but I will have limited time to poke around as
> this is a production server.  Is there a specific set of data that
> would be useful.  As you can see from the ps output, many processes
> are hung (I count 57).  Surely we do not need backtraces from all
> of those.  A set of commands to run would be most helpful.

same here, generally limited time to poke as I need to get the 
machines running again so a precise set of commands would help.

> > - Have you tried to leave only one CPU online and see if problem
> > still occurs? See cpuctl(8) man page.
> I am reluctant to do this as it will impact the server performance.

the rt box I've booted "boot -1" after the most recent occurrence, so 
we'll see if that makes a difference.

> I'm using neither at the moment.  If I see no hangs with the new
> netbsd-5 kernel, I may try WAPBL again.  With WAPBL I would get
> hangs maybe once per week.

All these boxes have wapbl but I think the same problem existed back 
when they had softdeps (just took longer to reboot :-()

Its just occurred to me that both of these boxes and also the third 
most likely for it to occur are all running apache (but they are all 
actually doing lots of things).

Unfortunately this is all a bit vague, and difficult to reproduce, 
which is one of the reasons I hadn't mentioned it until now.


Home | Main Index | Thread Index | Old Index