tech-kern: Re: serial port silo overflow repair

Subject: Re: serial port silo overflow repair
To: None <tech-kern@NetBSD.ORG>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-kern
Date: 07/29/1997 00:34:05
Erik E. Fair" (Time Keeper) <fair@clock.org> writes:

 >We've had a recurring problem with serial port "silo overflow" conditions
 >happening in various versions of NetBSD on various platforms which we've
 >not dealt with in a systematic way. A friend of mine who once worked on the
 >Amiga DOS suggested a way to find those places in our kernel that are
 >locking out serial interrupts for "too long", thus causing "silo overflows":
 >
 >Every time a silo overflow occurs, the serial interrupt handler should
 >report not only that it did happen, but also where the program counter was
 >when it did (i.e. pick the PC off the stack and print it). He also
 >recommended reporting the PC for "near full" silo conditions, to find the
 >marginal cases. Since the interrupt will be posted just after interrupts
 >are enabled again, the PC should be at the end of the offending code
 >section(s). Armed with this information, we can tune the kernel better (and
 >in the end, it should be more responsive to a whole host of events, not
 >just serial interrupts).


 >In principle, this information should be easy to get at, since the clock
 >interrupt routine does it to count CPU idle time (by comparing the PC
 >against the location in locore.s where the CPU will be spinning when there
 >is nothing to run).


Not to rain on your parade or your friend, but....

This is a nice idea, but it often doesn't work quite as suggested.

What almost always happens to *me* is that you discover the interrupt
is actually serviced in the tail of in splx(), when the Bad Guy
lowered SPL down to where the serial interrupt went off.

I've seen this same thing ad nauseam when doing kernel profiling.
I regularly see that up to 30% of kernel time is (apparently) inside
splx().

The idle-time counter you mention is a special case: we _know_ the
idle loop runs at spl0 :).

To apply this on *BSD kernels where splx() isn't inlined, you really
need to do a stack traceback thorugh several frames below splx() to
figure out where the offending routine is, and where it's called from;
and you probably need to do that often enough to get a good sample of
where the interrupts are being blocked out. (if you take just one
sample it really might be a shortlived splstatclock() or sphigh() that
pushes the serial port over the limit, when the problem is really
elsewhere.)

On many ports, doing this stack traceback is infeasible without
debugging info; and you need to figure out where to put the traceback,
too. I suppose you can addlog() the stacktrace. (I hacked the pmax
port to do that.) But the volume of debugging output often causes yet
more overrruns.


And, of course, on some (many?) ports, if the problem is really in a
device-driver or code called from a device-driver interrupt routine,
spl doesn't get lowered until the offending interrupt routine has
returned back to generic interrupt-dispatch code for the bus the Bad
Guy is on.

At which point its just too late for this technique: the Bad Guys'
stackframes are *gone*.


One possible exception is the i386 port, where actually disabling
interrupts is too expensive, and SPLs are enforced with a software
mutex.  Since the interrupts are never actualy dsisabled, it's
_relatively_ straightforward to add stack-traceback code there, to
trace when serial (or other) interrupts go off when they're blocked.

Now that Charles has rewritten com.c to use splserial() for serial
device interrupts and spltty() for `normal' tty-driver processing,
this should be less of a problem.

But really, the generic answer to this problem has been well-known for
decades: increase the width of the interface to the tty bottom-half
(to allow queues of chars in and out, not a call per char); and do
pseudo-DMA to accumulate a reasonable amount of characters per call.


Still, this -- on ports where the interrupt enable/disable structure
makes it feasible -- sounds like a great idea. I'd be interested to
see the results.