Subject: Re: vnode usage and reclaimation - feels like deadlocking
To: Stephen M. Jones <smj@cirr.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: netbsd-users
Date: 01/20/2004 18:48:06
--Pk6IbRAofICFmK5e
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 16, 2004 at 06:14:29PM -0600, Stephen M. Jones wrote:
> I've experienced vnlock deadlockish behaviour twice today since increase
> kern.maxvnodes to ~25% (250000) of system memory (1GB).  Both clients
> locked up just about the same time, although only one had few complaints=
=20
> about the fileserver not responding.  The interesting thing is that=20
> one had 77122 vnodes used while the other had about 65000 .. still, there
> was much delay from vnlocks, so much that the both clients had to be
> dropped to the debugger which showed a majority of the processes (under
> 300 processes total) in a 'vnlock' state.  This particular lock is
> initialised on line 537 or vfs_subr.c:

Knowing which lock it isn't good enough. _All_ vnode locks go through that=
=20
path. :-)

>         vp->v_type =3D VNON;
>         vp->v_vnlock =3D &vp->v_lock;
>         lockinit(vp->v_vnlock, PVFS, "vnlock", 0, 0);
>         cache_purge(vp);=20
>         vp->v_tag =3D tag;
>         vp->v_op =3D vops;=20
>         insmntque(vp, mp);=20
>         *vpp =3D vp;
>         vp->v_usecount =3D 1;
>         vp->v_data =3D 0;
>         simple_lock_init(&vp->v_uobj.vmobjlock);
>=20
> I was able to get crash dumps of both clients and have since rebooted
> them.  (anyone have a software watchdog that would crash dump a system
> when it hangs like this?)

I think you ran into one of two scenarios. Or perhaps both. You said you=20
have about 300 processes in that state. Your real problem is that the bug=
=20
involves at most one of them. :-|

The key problem is that you ran into a deadlock situation with one of the
deadlockees. It had a vnode locked while this was going on. All the other
processes are, one way or another, blocked waiting for that node to
release the vnode lock. One scenario is that you have a web server, and
one of the serving files (a file already open in the server) deadlocked. =
=20
The other threads that try and read said file (read(2), pread(2), etc.),=20
which grabs the vnode lock.

A second scenario is that after the initial deadlock, some other process=20
tried to do a name lookup (i.e. an open(2)) on the deadlocked file. That=20
will then lock the parent directory. The next lookup of any file in that=20
directory will lock the grandparent directory. The third will lock the=20
great-grandparent. This process will continue until the root vnode is=20
locked, and all new name lookups will deadlock. The system's really wedged=
=20
at that point. If you're seeing different processes have issues, chances=20
are this has happened.

You could have both things happening at once.

To track it down, I'd suggest looking at what process owns the locks the=20
processes are waiting for. If you're using ddb, ps/w will show you the=20
wait channel, which I think is the vnode lock you're aluding to above. If=
=20
you're looking at a core dump in gdb, the vnode's address will be in the=20
stack trace. I think lk_sleep_lockholder in the lock structure is the pid=
=20
of the lock owner. Look and see what it's waiting on.

One of your EMails showed a lot of cron jobs. Overall, I'd say start=20
looking at the five or ten oldest processes that are stuck. Once the root=
=20
node is deadlocked, everything else will be and won't help find the=20
problem. So you can save yourself a lot of grief trying to figure out=20
things that won't help.

Take care,

Bill

--Pk6IbRAofICFmK5e
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFADehmWz+3JHUci9cRAh7OAJ4uTYEcz4uF0D0C7PrLW4lJLa1USACgjgc/
lMCT7raBrkUeL/KhXXm2JxQ=
=viSm
-----END PGP SIGNATURE-----

--Pk6IbRAofICFmK5e--