Subject: Re: NULLFS locaking problem?
To: Feico Dillema <feico@pasta.cs.uit.no>
From: Bill Studenmund <wrstuden@zembu.com>
List: current-users
Date: 10/03/2000 14:33:15
On Thu, 7 Sep 2000, Feico Dillema wrote:

> For the past half year I have had problems on our server which seem to
> indicate a locking problem in NULLFS. I cannot reliably reproduce it,
> but now and then a filesystem (that is remounted with NULLFS
> elsewhere) completely locks up. And my only solution has been to
> reboot the machine when this happens (and now by getting rid of the
> NULLFS mounts). Time between manifestations of this problem range
> between a few hours to several weeks. We've ruled out hardware as the
> problem source, and have seen the problem on various NetBSD kernel
> versions (from -current as of end of last year to NetBSD-1.5_ALPHA and
> NetBSD-1.5_ALPHA2).

Hmmm.... I'm way behind on my EMail. :-)

What mounts were causing problems? What was under the NULL mounts?

See if you can get the filesystem locked up, and then do a ps -l. The
important thing is to see what the processes are waitning on. If they are
sleeping on a vnode lock, then the WCHAN will be vnlock. Those are the
interesting ones.

The best thing would be to build a kernel with debug (so you'll get a
netbsd.gdb), then get a core dump when the machine hangs. From digging
through that, we can find out what the errant processes were doing.

As a second-best option, when the machine is hung, get to the console &
drop in ddb. You can do a ps there, and I think ps/w will show you all the
processes. Find which ones are in vnlock wait, and trace them. "trace/t
0t<pid>" will show you the stack trace of process <pid>. Note that ps
shows pids in decimal and ddb by default wants hex. The 0t makes the
number be taken as decimal.

Take care,

Bill