port-i386: Re: possible problem with NFS?

Subject: Re: possible problem with NFS?
To: Randy Terbush <randy@zyzzyva.com>
From: Tom Ivar Helbekkmo <tih@nhh.no>
List: port-i386
Date: 08/01/1995 22:09:00
On Mon, 31 Jul 1995, Randy Terbush wrote:

> The problem first manifested itself when doing large (>10MB) file
> reads over a PPP NFS link. Lots of dropped packets. I gave up trying
> to run that hourly cron process (which had been running without fail
> prior to July 3rd).
> 
> I can also lock the machine when attempting to sup an entire source
> tree.  Locks after about 3 hours.

Sounds like it might be the same problem I'm trying to figure out.  I
used to run a -current from around the middle of March, which was OK
until I recently started using the 386 box as an NFS server.  This
caused a number of crashes which were clearly related to NFS activity,
so I decided to upgrade.

I installed the July 7th tar file kit, and it seems that the NFS
problem had gone away at that stage, but been replaced by another one.
I now get crashes when I've got heavy serial port activity.  I've
since upgraded to the July 27th tar file kit, and have tried replacing
my 25Mhz i386 CPU with a 33MHz i486, but neither of these experiments
made any difference, except that definite improvements have been made
from July 7th to July 27th; the strain of serial I/O on the rest of
the system is very much less after this upgrade.  With the July 7th
system, just switching virtual consoles would cause lost characters
and retransmissions.

The specific crash that I keep getting goes like this:  I start a
large file transfer with kermit, fetching a file in through a serial
port (with a 16550a) at 19200 bps.  Most of the time, between a half
and one and a half megabytes will be transferred before the crash,
when I get the following output on the console:

	vm_fault(f87ca400, f7fec000, 1, 0) -> 1
	kernel: page fault trap, code = 0
	stopped at: _comstart + 0x9c: movb 0(%ebx), %al

The "f8ca400" differs from crash to crash, but the "f7fec000" is
always the same.  The page fault is always at the same instruction in
the comstart routine, and the ebx register contains f7fec000.  A
traceback in the kernel debugger shows the call stack to be:

	_comstart
	_ttstart
	_ttwrite
	_comwrite
	_spec_write
	_ufsspec_write
	_vn_write
	_write
	_syscall

So, there's nothing surprising there, just a regular write to the
device, which somehow gets screwed up along the way.  And of course,
it only happens once in a very long while...  :-)

-tih