port-alpha: Re: NFS writes NetBSD vs FreeBSD

Subject: Re: NFS writes NetBSD vs FreeBSD
To: None <port-alpha@netbsd.org>
From: Stephen Jones <smj@cirr.com>
List: port-alpha
Date: 07/16/2004 10:59:16
Well, both machines hung hard late last night with no signs of why and 
I unfortunately
do not have access to the halt switches on either to try to continue 
from SRM into a debugger.

But, my tests ran for about an hour before the FreeBSD CS20 hung hard 
first.  While power
cycling it to reboot the NetBSD CS20 hung hard (no doubt the loss of 
its NFS mount
triggered some problems).

My goals are simple .. reliability and user experience... I'm not 
really looking at getting
the fastest chunk of data across an NFS mount as long  speed is decent 
and other
processes and operations running on that mount aren't negatively 
affected.   To simulate
my users, I did the following on both machines:

1. Allocated a 520mb md/mfs and had dd continously write 512mb of zeros 
to it
     (both CS20s use the same type of memory)
2. ftp a 500 files of zeros from a local disk on the remote to a local 
disk locally
     repeatedly on the primary (fxp1)
3. write 200mb of zeros repeatedly on each other's nfs mounted 
filesystem (via fxp0)

What was interesting is that the FreeBSD load report was strange... I 
saw 0.19 at
the height of it, yet the system felt very sluggish and local and 
remote operations
had a significant delay measurable in seconds .. sometimes the system 
would appear
to be hung with no response even from a ^T and then suddenly come back 
to life with
no errors reported.  Writing 512mb to the memory disk initially took 
about 6.7 seconds
but, as you'd expect, slowly moving up to about 22.4 seconds over 257 
iterations before hanging.

NetBSD started out at 4.3 seconds and slowly made its way up to 22.4 
over 268
iterations before hard hanging shortly after FreeBSD hung.

The transfer numbers again aren't super important to me, but rather how 
the systems
felt during that hour.  I had a loop with a 20 second sleep doing ls 
operations on each
NFS mounted filesystem and I found that the NetBSD CS20 completed using 
less
real seconds (by a few, sometimes more) than the same operations 
running on the
FreeBSD CS20 ..  I also had top running on both to monitor process 
states or vnlock
nfsrcvlk deadlocks.   I'm guessing I'd have to get something like this 
into production
to actually see if vnlock deadlocks even occur.  As we had suspected, 
once the fxp
driver was straightened out this would improve.

Starting top on FreeBSD usually takes a several seconds, but during the 
high load and
activity it took several minutes.  On NetBSD top started up within a 
few seconds, and
ps, vmstat, netstat, pstat .. all responded faster than on the FreeBSD 
CS20.

After both hung hard, I rebooted them and just started #3 up, which has 
run continously
all night (at least for the past 11 hours).

No NFS timeout / not responding/alive messages have been reported by 
either, but
we assumed that would go away once the fxp driver was sorted out.

There was no evidence of what caused the hard hangs and I think that is 
the toughest
problem.  Michael Hitch pointed out that it is possible to get to a 
debugger, but you need
some sort of minion to hit the halt switch for you.  I've got a scrap 
CS20 here I'm going to
try to wire up an APC remote to (MP or not, I had power cycling CS20s 
or any computer).

I suppose what I could do is try running the same abusive test single 
CPU kernels and
hopefully get the panic message.