Subject: Re: NetBSD-1.5 and NFS - any suggestions?
To: Frank van der Linden <fvdl@wasabisystems.com>
From: Artem Belevich <art@riverstonenet.com>
List: netbsd-users
Date: 01/25/2002 16:27:18
> Wait.. is this on a NetBSD *client* that is temporarily stuck?
> Is it also serving as an NFS server? Ret-Failed is the number
> of RPC replies that signal an error condition.

These machines get most of their traffic as NFS clients but they also
export some local filesystems.

It's unclear what exactly causes systems lockups. So far there is a
strong correlation between system getting stuck and compilation on the
NFS-mounted filesystem. Lockups do not happen during the night, so it
must be related to some user activity. 

> If reducing the write size doesn't help, it's hard to tell what's
> going on. Maybe you can send me a tcpdump output from when a
> hang occurs, which shows before-during-after traffic. Also,

It can take awhile, but I'll get it. 

> if you see some kind of a chance to run ps while it's hanging
> (maybe start up top(1) before the hiccup occurs and watch the
> window), I'd like to see the output of it. The STATE (for top)
> or WCHAN (for ps) columns are important.

I did catch the pause in top before but it didn't give a lot of
clues. Before the pause everything was normal (most processes in
"sleep" state, 1 run, 1 onproc). Then everything gets stuck with top
still showing picture of quiet upscale neighbourhood with everybody
sleeping. 1-second refresh appears too slow to catch the glitch. :-(

I'll try running nice'd top with constant updates. Maybe that will help.

Oh! I've just got one of the systems stuck. Here's top's output (pretty
useless, though). Three nfsd daemons are close to the top, though.
Maybe pauses *are* related to the NFS server on the boxes.
OK, the bux just unstuck. This one was short - about 20 seconds.
I've seen pauses as long as 10 minutes. 

load averages:  0.12,  0.29,  0.69                                                                                                 16:07:08
60 processes:  59 sleeping, 1 on processor
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Memory: 32M Act, 12M Inact, 4060K Wired, 357M Free, 2049M Swap free

  PID USERNAME PRI NICE   SIZE   RES STATE     TIME   WCPU    CPU COMMAND
 4086 ???????    2    0  9156K   16M sleep     0:43  0.00%  0.00% xemacs-21.1.
 3276 root       2    0    48K  248K sleep     0:29  0.00%  0.00% nfsd
  248 root       2    0   644K 1508K sleep     0:19  0.00%  0.00% sshd
 3277 root       2    0    48K  248K sleep     0:03  0.00%  0.00% nfsd
 1401 root       2    0   644K 1508K sleep     0:01  0.00%  0.00% sshd
 4346 root      28    0   276K  776K onproc    0:00  0.00%  0.00% top
  209 root      18  -12   652K 3472K sleep     0:00  0.00%  0.00% ntpd
  709 root      18    0   452K  316K sleep     0:00  0.00%  0.00% csh
 4093 ???????   18    0   452K  312K sleep     0:00  0.00%  0.00% csh
  890 ????????  18    0   448K  304K sleep     0:00  0.00%  0.00% csh
 3854 ???????   18    0   436K  296K sleep     0:00  0.00%  0.00% csh
  372 ??????    10    0   656K 1036K sleep     0:00  0.00%  0.00% bash
  250 ???       10    0   620K 1008K sleep     0:00  0.00%  0.00% bash
 3927 ???       10    0   788K  928K sleep     0:00  0.00%  0.00% perl
 3213 root      10    0   524K  852K sleep     0:00  0.00%  0.00% bash
  240 root      10    0   216K  428K sleep     0:00  0.00%  0.00% cron
    1 root      10    0   312K  192K sleep     0:00  0.00%  0.00% init
 3861 ???????    3    0  1016K 1348K sleep     0:00  0.00%  0.00% tcsh
 4098 ???????    3    0   972K 1256K sleep     0:00  0.00%  0.00% tcsh
 1600 root       3    0   988K 1248K sleep     0:00  0.00%  0.00% vi
 2577 ??????     3    0   900K 1244K sleep     0:00  0.00%  0.00% tcsh
  650 ??????     3    0   684K 1064K sleep     0:00  0.00%  0.00% bash
  356 ??????     3    0   656K 1036K sleep     0:00  0.00%  0.00% bash
 4426 ??????     3    0   648K  996K sleep     0:00  0.00%  0.00% vi
 1402 ???        3    0   608K  992K sleep     0:00  0.00%  0.00% bash
  242 root       3    0    48K  436K sleep     0:00  0.00%  0.00% getty
  245 root       3    0    48K  428K sleep     0:00  0.00%  0.00% getty
  244 root       3    0    48K  428K sleep     0:00  0.00%  0.00% getty
  243 root       3    0    48K  428K sleep     0:00  0.00%  0.00% getty
  892 ????????   3    0   460K  320K sleep     0:00  0.00%  0.00% ksh
  141 root       2    0   564K  960K sleep     0:00  0.00%  0.00% rpcbind
  191 root       2    0   436K  788K sleep     0:00  0.00%  0.00% amd
  232 root       2    0   240K  728K sleep     0:00  0.00%  0.00% sshd
  165 root       2    0   300K  708K sleep     0:00  0.00%  0.00% mountd
  355 root       2    0   148K  620K sleep     0:00  0.00%  0.00% rlogind
 2575 root       2    0   148K  620K sleep     0:00  0.00%  0.00% rlogind
  707 root       2    0   148K  620K sleep     0:00  0.00%  0.00% rlogind
  888 root       2    0   148K  620K sleep     0:00  0.00%  0.00% rlogind
  649 root       2    0   148K  620K sleep     0:00  0.00%  0.00% rlogind


> I thought NetApp toasters did to TCP.. hm. Oh yes, one final
> thing: do other clients work?

Other clients seem to be perfectly happy with NetAPP filers. These are
mostly Sun boxes. My FreeBSD box seems to be working fine with NetApps
too.  They (NetApps) are pretty good when they don't crash.

--Artem