tech-kern: Re: PR 32535

Subject: Re: PR 32535
To: SODA Noriyuki <soda@sra.co.jp>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 10/24/2006 11:13:49
--pf9I7BMVVzbSWLtt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Oct 24, 2006 at 07:06:08PM +0900, SODA Noriyuki wrote:
> >>>>> On Mon, 23 Oct 2006 20:29:12 -0700,
>       Bill Studenmund <wrstuden@NetBSD.org> said:
>=20
> > I have a proposed fix for PR 32535, and I'd like other folks to look it=
=20
> > over.
>=20
> This is other PR about vnlock with nullfs in kern/32409.
> And it seems this patch didn't fix the that case at least.
> The vnlock problem happend on the machine with this fix.

Ok.

Unfortunately I can't tell much from the PR so far.

It looks like something's going wrong with a vnode while the vnode lock is=
=20
held. Then directory lookup & such piles up on the locks, and we race for=
=20
root. My guess is that 0x70e91e50, the vnode most processes are piled up=20
on, is the root vnode.

How easy is this to reproduce? Is this what's taking ftp.n.o down?

I wish we had gdb. I have a script that is supposed to walk a vnode chain.=
=20
So you could point it at that vnode, it would find the process owning it,=
=20
and see what it's sleeping on. And so on until we find a vnode owned by a=
=20
process not sleeping on a vnode. _That_ is the source of the problem.

The other option I see is we can extend ddb's print routines so that it=20
will print the process that holds the lock on a vnode. Then you can dump a=
=20
vnode, see a proc, look at that proc's wait channel, and itterate.

Take care,

Bill

--pf9I7BMVVzbSWLtt
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)

iD8DBQFFPlfdWz+3JHUci9cRAvuFAKCUAEXM8Mn37axwOgg4eii38KPW6gCfZRpY
gnXVY04GN8+SseN7/up206s=
=QU7v
-----END PGP SIGNATURE-----

--pf9I7BMVVzbSWLtt--