netbsd-users: Re: Nfs clients get frozen when NFS server crashes...

Subject: Re: Nfs clients get frozen when NFS server crashes...
To: None <netbsd-users@netbsd.org, tech-kern@netbsd.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: netbsd-users
Date: 10/04/2000 08:27:29
    Date:        Tue, 3 Oct 2000 17:05:28 +0000
    From:        Sam <sam@epita.fr>
    Message-ID:  <20001003170528.A5555@epita.fr>

  | When this hapenns all the netbsd clients that try to access the exported
  | file systems which are not accessible, get stuck, and sometimes get
  | completely frozen.

I see this as well.

  | Here is how they are mounted:

This is my standard mount ...

munkora-mmlab:/local/tech /local/tech nfs rw,nosuid,-b,-P,-c,-i,-s

which (aside from nodev which certainly isn't relevant, but I should add)
translates into much the same as you are using.

  | We then have to reboot the 500 clients each time the servers crashes...
  | It's really annoying and shouldn't happen.

No, it shouldn't.   And it is annoying.   And it has been this way in
NetBSD forever (well, since 1.3 or before).

It is not impossible that the combination of "soft,intr" (-i,-s in my case)
is what is causing the problem - I can't say that I have seen this problem
on the odd mount that is using regular hard mounts (I don't think I have any
that are just soft or intr alone).

This has been on my "I absolutely am going to debug this problem" list for
about 2 years now ... I wanted to use this to get some kgdb experience,
actually figure out just what the processes were doing, but then i discovered
that kgdb doesn't work on sparc (it is possible to get into the debugger,
but never to get out again - some register is being clobbered by the low
level debugger interface).  At that point my efforts stalled...

Manuel - these are certainly all UDP mounts.  I think the clients simply
stop sending to the server (it is a long time ago now, but I am pretty sure
I have monitored the net and seen nothing).

Rick - it isn't just impatience - I have had clients hung for weeks in this
state (we have labs full of NetBSD clients for student use - there are times
during the year when there is little usage, if they're hung, they can easily
stay that way for a month or so with no-one noticing - that has happened).

Frank - Sam sent ps output.  I often can't - we have tend to have so many
cron based nfs accesses that by the time anyone notices this is happening the
process table on the clients is full, and it isn't possible to do anything
at all (in particular, to log in...).

Even though (as Sam showed) the processes are nominally in 'D' state, and
hence should be unkillable, I have found that a "kill -9" will often kill
them, if I'm lucky enough to get to them before the system wedges completely.
It is also the case that once processes get into this state, any more NFS to
the server (perhaps any server, I don't know) hangs as well, even if the server
has previously recovered.   That is, there's something gets exhausted on the
client (perhaps nfsiod processes), so everything after gets blocked as well.

Our NFS servers have become much more reliable in the past year or so (no,
not NetBSD) so this is less of an issue now than it used to be.    But it is
still on my "I will debug this unless someone else fixes it first" list,
I just need to find a group of systems I can use where kgdb works (I assume
it does work in i386 NetBSD?   ... if so, I will try a setup using a group
of them one day).

kre