port-i386: Re: month long page fault traps

Subject: Re: month long page fault traps
To: Kent Vander Velden <graphix@iastate.edu>
From: Tom Ivar Helbekkmo <tih@nhh.no>
List: port-i386
Date: 08/07/1995 08:49:18
On Fri, 4 Aug 1995, Kent Vander Velden wrote:

>   For some time now my machine has crashed every couple of days with the
> same sort of message:
> 
>   vm fault(f8752700, f7fec000, 1, 0) -> 1
>   kernel: page fault trap, code=0
>   stopped at comstart+0x9c: movb 0(%ebx),%al
> 
>   This kernel has been staying fairly current.  The current kernel on
> the machine was current as of two days ago.  This is a i386 kernel.  Is
> there something that I can do to solve this problem?  Have other people
> seen this problem?

I've been trying to figure this one out for a while now.  I've been
getting the exact same crash while running the July 8th and July 29th
tar file distribution from ftp.iastate.edu.  The traceback on the
crash shows the call stack to be:

	_comstart
	_ttstart
	_ttwrite
	_comwrite
	_spec_write
	_ufsspec_write
	_vn_write
	_write
	_syscall

I'm running a 25MHz i386 with 16550a (buffered, that is) UART chips.
The crash typically occurs when I use the serial port heavily, and
seems to be dependent on there being network and/or disk activity by
other processes -- but I can't prove this.  My impression is that I
can load the machine down as much as I like with network and disk I/O
while the kermit transfer is going on -- but when the crash does
happen, it always seems to be in direct response to some specific
action.  For instance, if I try to do lots of ftp'ing in and out of
the NetBSD system while kermit is running, it'll stand up to that for
a while -- and then I might, after a pause, just give a 'pwd' command
from an ftp client on another machine, and POOF!  Instant crash.  I
can't prove any connection between the action and the crash, though.

The code in comstart looks like this:

	tp->t_state |= TS_BUSY;
	if (sc->sc_hwflags & COM_HW_FIFO) {
		u_char buffer[16], *cp = buffer;
		int n = q_to_b(&tp->t_outq, cp, sizeof buffer);
		do {
			outb(iobase + com_data, *cp++);
		} while (--n);
	} else
		outb(iobase + com_data, getc(&tp->t_outq));

The crash happens during the dereferencing of 'cp' to fetch a byte to
be transmitted by 'outb' inside the 'do' loop.  Following a suggestion
by Robert Dobbs, who noted that the call to q_to_b() might return 0
(if there for some reason weren't any actual characters to be
transmitted), I changed this to:

	if (sc->sc_hwflags & COM_HW_FIFO) {
		u_char buffer[16], *cp = buffer;
		int n = q_to_b(&tp->t_outq, cp, sizeof buffer);
		if (n) {
			tp->t_state |= TS_BUSY;
			do {
				outb(iobase + com_data, *cp++);
			} while (--n);
		}
	} else {
		tp->t_state |= TS_BUSY;
		outb(iobase + com_data, getc(&tp->t_outq));
	}

(I've added a test for the return value, and moved the setting of the
busy flag in, so that it only gets set when something is transmitted.)

Sure enough, this avoids the crash here.  However, the crash still
happens, only this time it stops at a "repe movsl (%esi), %es:(%edi)"
in bcopy().  The call traceback now shows bcopy() being called by
ttwrite(), which is clearly bogus, since there is no such call there.
In other words, the stack is screwed up.

There have been recent changes to "/sys/arch/i386/isa/isa_machdep.c"
to change the relative "interruptibility" of the various interrupt
handlers in the kernel.  As an experiment, I tried to mutually block
the NET, TTY, BIO and IMP stuff like this:

	imask[IPL_IMP] |= imask[IPL_TTY] | imask[IPL_NET] | imask[IPL_BIO];
	imask[IPL_TTY] |= imask[IPL_IMP];
	imask[IPL_NET] |= imask[IPL_IMP];
	imask[IPL_BIO] |= imask[IPL_IMP];

This did not help, so the problem is probably not one of relative
priorities between these splXXX() levels.

Any ideas, anyone?

-tih