port-sparc: Re: Problems with nfsd on NetBSD 1.2.1/sparc (some useful info)

Subject: Re: Problems with nfsd on NetBSD 1.2.1/sparc (some useful info)
To: Frank van der Linden <frank@wins.uva.nl>
From: Brian Buhrow <buhrow@cats.ucsc.edu>
List: port-sparc
Date: 04/28/1997 15:32:48

OK. After debugging the problems I was having with gdb, you need to
be running in 8-bit mode to use NetBSD gdb, I think I have both a way to
reproduce the problem, an idea of where the problem might be, though I'll
admit that I don't understand exactly what is going on here, and a possible
fix.

The problem.
I have a NetBSD/sparc system (an IPC with 24MB of memory) running
NetBSD 1.2.1 (no changes from the 1.2.1 release.) acting as a file server
for both a Vax running Ultrix 4.2 and an I386 NetBSD 1.2 system. During
periods of heavy i/o, i.e. tarring huge trees onto the nfs partition, or
removing huge trees from the nfs partition, the nfsds on the IPC would get
stuck in disk wait and the wchan from ps would give the address: 143748.
The only way to get things going again was to reboot the system, at which
point things would run fine until the process repeated itself.

How to reproduce.
1. Set up a Sparc with NetBSD 1.2.1 and export a ccd filesystem to
nfs.

2. Use a NetBSD/i386 nfs client to read and write, mostly write, large
amounts of data to the nfs exported filesystem. When the nfsd with the
lowest process id uses somewhere between 37 and 1 minute of CPU time, it
will lock down in the kernel in the indicated location. All other nfsds
will folow suit.

NOTES AND THEOORIES
Upon investigation of the address given in ps, it looks like the nfsds
are getting stuck somewhere that is trying to update values in kmemstats,
which is an array of structures where each element is the type of memory
being allocated. According to my calculations, each structure is 32 or
0x20 bytes in size, and the distance between the start of the kmemstats
array, 0xf8142e48 and the location of interest, 0xf8143748 is 0x900 bytes.
0x900/0x20 is 0x48 or 72. memstats[72] is the M_NFSRVDESC memory bucket
which is allocated in nfs_socket.c and freed at various points in
nfs_socket.c and nfs_syscalls.c.
At this point, my understanding of what is going on is growing very
sketchy. The place where the bucket is allocated is nfs_socket.c, line
2122 in nfsrv_dorec(). The version of the file is:
/* $NetBSD: nfs_socket.c,v 1.27.4.4 1997/03/04 18:06:29 mycroft Exp $ */
There seems to be some sort of race condition in the kernel allocator
and the nfs rpc code, though what it is and where it is, I'm not sure. I
hope this is enough information to let people who know more find the
problem and squash it. Note that it might have to do with the fact that
I'm serving from a ccd, but I haven't tested this theory yet.

My fix
My fix was to install a generic NetBSD 1.2 kernel (NetBSD-1.2/sparc/extras/netbsd from ftp.netbsd.org).
Everything works fine in terms of nfs service now.

I'd be interested in any comments/fixes/things to try anyone might
have.
-thanks
-Brian