Subject: Re: NetBSD-1.5 and NFS - any suggestions?
To: Frank van der Linden <fvdl@wasabisystems.com>
From: Artem Belevich <art@riverstonenet.com>
List: netbsd-users
Date: 01/25/2002 16:27:18
> Wait.. is this on a NetBSD *client* that is temporarily stuck?
> Is it also serving as an NFS server? Ret-Failed is the number
> of RPC replies that signal an error condition.
These machines get most of their traffic as NFS clients but they also
export some local filesystems.
It's unclear what exactly causes systems lockups. So far there is a
strong correlation between system getting stuck and compilation on the
NFS-mounted filesystem. Lockups do not happen during the night, so it
must be related to some user activity.
> If reducing the write size doesn't help, it's hard to tell what's
> going on. Maybe you can send me a tcpdump output from when a
> hang occurs, which shows before-during-after traffic. Also,
It can take awhile, but I'll get it.
> if you see some kind of a chance to run ps while it's hanging
> (maybe start up top(1) before the hiccup occurs and watch the
> window), I'd like to see the output of it. The STATE (for top)
> or WCHAN (for ps) columns are important.
I did catch the pause in top before but it didn't give a lot of
clues. Before the pause everything was normal (most processes in
"sleep" state, 1 run, 1 onproc). Then everything gets stuck with top
still showing picture of quiet upscale neighbourhood with everybody
sleeping. 1-second refresh appears too slow to catch the glitch. :-(
I'll try running nice'd top with constant updates. Maybe that will help.
Oh! I've just got one of the systems stuck. Here's top's output (pretty
useless, though). Three nfsd daemons are close to the top, though.
Maybe pauses *are* related to the NFS server on the boxes.
OK, the bux just unstuck. This one was short - about 20 seconds.
I've seen pauses as long as 10 minutes.
load averages: 0.12, 0.29, 0.69 16:07:08
60 processes: 59 sleeping, 1 on processor
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 32M Act, 12M Inact, 4060K Wired, 357M Free, 2049M Swap free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
4086 ??????? 2 0 9156K 16M sleep 0:43 0.00% 0.00% xemacs-21.1.
3276 root 2 0 48K 248K sleep 0:29 0.00% 0.00% nfsd
248 root 2 0 644K 1508K sleep 0:19 0.00% 0.00% sshd
3277 root 2 0 48K 248K sleep 0:03 0.00% 0.00% nfsd
1401 root 2 0 644K 1508K sleep 0:01 0.00% 0.00% sshd
4346 root 28 0 276K 776K onproc 0:00 0.00% 0.00% top
209 root 18 -12 652K 3472K sleep 0:00 0.00% 0.00% ntpd
709 root 18 0 452K 316K sleep 0:00 0.00% 0.00% csh
4093 ??????? 18 0 452K 312K sleep 0:00 0.00% 0.00% csh
890 ???????? 18 0 448K 304K sleep 0:00 0.00% 0.00% csh
3854 ??????? 18 0 436K 296K sleep 0:00 0.00% 0.00% csh
372 ?????? 10 0 656K 1036K sleep 0:00 0.00% 0.00% bash
250 ??? 10 0 620K 1008K sleep 0:00 0.00% 0.00% bash
3927 ??? 10 0 788K 928K sleep 0:00 0.00% 0.00% perl
3213 root 10 0 524K 852K sleep 0:00 0.00% 0.00% bash
240 root 10 0 216K 428K sleep 0:00 0.00% 0.00% cron
1 root 10 0 312K 192K sleep 0:00 0.00% 0.00% init
3861 ??????? 3 0 1016K 1348K sleep 0:00 0.00% 0.00% tcsh
4098 ??????? 3 0 972K 1256K sleep 0:00 0.00% 0.00% tcsh
1600 root 3 0 988K 1248K sleep 0:00 0.00% 0.00% vi
2577 ?????? 3 0 900K 1244K sleep 0:00 0.00% 0.00% tcsh
650 ?????? 3 0 684K 1064K sleep 0:00 0.00% 0.00% bash
356 ?????? 3 0 656K 1036K sleep 0:00 0.00% 0.00% bash
4426 ?????? 3 0 648K 996K sleep 0:00 0.00% 0.00% vi
1402 ??? 3 0 608K 992K sleep 0:00 0.00% 0.00% bash
242 root 3 0 48K 436K sleep 0:00 0.00% 0.00% getty
245 root 3 0 48K 428K sleep 0:00 0.00% 0.00% getty
244 root 3 0 48K 428K sleep 0:00 0.00% 0.00% getty
243 root 3 0 48K 428K sleep 0:00 0.00% 0.00% getty
892 ???????? 3 0 460K 320K sleep 0:00 0.00% 0.00% ksh
141 root 2 0 564K 960K sleep 0:00 0.00% 0.00% rpcbind
191 root 2 0 436K 788K sleep 0:00 0.00% 0.00% amd
232 root 2 0 240K 728K sleep 0:00 0.00% 0.00% sshd
165 root 2 0 300K 708K sleep 0:00 0.00% 0.00% mountd
355 root 2 0 148K 620K sleep 0:00 0.00% 0.00% rlogind
2575 root 2 0 148K 620K sleep 0:00 0.00% 0.00% rlogind
707 root 2 0 148K 620K sleep 0:00 0.00% 0.00% rlogind
888 root 2 0 148K 620K sleep 0:00 0.00% 0.00% rlogind
649 root 2 0 148K 620K sleep 0:00 0.00% 0.00% rlogind
> I thought NetApp toasters did to TCP.. hm. Oh yes, one final
> thing: do other clients work?
Other clients seem to be perfectly happy with NetAPP filers. These are
mostly Sun boxes. My FreeBSD box seems to be working fine with NetApps
too. They (NetApps) are pretty good when they don't crash.
--Artem