Subject: Re: implementation: NetBSD on AS1200s
To: Stephen M Jones <smj@cirr.com>
From: Robert Elz <kre@munnari.OZ.AU>
List: port-alpha
Date: 03/07/2002 16:31:54
    Date:        Wed, 6 Mar 2002 13:28:30 -0600 (CST)
    From:        Stephen M Jones <smj@cirr.com>
    Message-ID:  <200203061928.g26JSVf19627@egsner.cirr.com>

  | Unfortunately I've never been able to find out.  Though I'm positive that
  | its a cronjob getting upset that it can't access a file on an NFS'ed 
  | system.

Yes, they will be processes hung on client NFS.

Get rid of the -i option on your NFS mounts - from your description of
your environment it isn't something that you'd actually want anyway,
and I am fairly sure it simply does not work on NetBSD - it causes clients to
simply stop retrying once an NFS server has failed to reply for any reason
(which the ethernet driver problems mentioned in other messages on this
thread could easily cause - I mostly saw it when the server would crash,
but that's a different issue).   If you can get to the clients quickly enough
when the problem happens, it is possible to kill the processes, but you
have to be lucky (or extremely vigilant) for this to work.

Note, not only do NFS client apps at the time of the outage hang, but the
kernel either exhausts NFS client contexts, or some other resource gets
locked - no NFS accesses (to the server in question, perhaps to the
filesystem in question - there was never a difference for me) get processed
until all the hung processes are killed (or the system reboots).   If you
can kill things though, it will all recover OK (until the next time).

  | Simply put, that is probably my fault for not having a check in place to
  | be sure that nfs is okay before running the job.

This helps - at least the system doesn't run out of proceses, but it still
needs a reboot to recover after it gets into this state.   Without -i NFS
recovers all by itself, and unless your NFS server is out for a long time,
even fairly frequent cron jobs are unlikely to overflow the process table
(and even if they do, once the server recovers, they will eventually all
clean themselves up unless the lack of processes causes deadlock elsewhere)

I think there might be a quite old PR of mine about the NFS problems, they
certainly date back (at least) to NetBSD 1.3 vintage systems.   I suspect
that there simply aren't enough people who mount with -i (and then have NFS
outages long enough to trigger the problem) for anyone to care enough to
really look into it (for a while I tried, but nothing reached out and struck
me).

kre