Subject: Re: measuring interrupt latency etc.
To: Eduardo Horvath <eeh@turbolinux.com>
From: Leo Weppelman <leo@wau.mis.ah.nl>
List: tech-kern
Date: 05/03/2000 09:57:58
On Tue, May 02, 2000 at 09:06:20AM -0700, Eduardo Horvath wrote:
> On Tue, 2 May 2000, Leo Weppelman wrote:
> 
> > Since 4ms is quite some time, even on a 64MHz 68060, I suspect that someone
> > is blocking interrupts for a bit too long.

Before this gets burried, does anybody have different ideas about the 4ms 
being a long time?

> This sort of problem usually manifests itself in two ways.  Either there's
> a critical section that blocks interrupts for too long, or there's a
> higher level interrupt handler that takes too long.  
> 
> Detecting the cause for the first case is quite simple.  Test for the
> delay in your interrupt handler and when you do break in to DDB.  The
> interrupt should have occured just after the instruction that re-enabled
> the interrupt, so all you need to do is find the appropriate trap frame
> and figure out what instruction that is.  The problem code should
> immediately precede that.

OK, tried to do this. What I did was to call a function toolate() when the
ISDN interrupt module found out that it had an overrun. The toolate() function
just called the debugger. Last night, all frames caught were of the form:

	_toolate(2680,0,0,80,0) + a6
	_isic_hscx_irq(268c00,80,1,0,2604) + 17a
	_isicintr(268c00,2000,4f,28dce0,2beddc4) + 4a
	_iifun(1,2000,0,1,1) + 120
	_intr_dispatch(20,0,154efc,15b3d4,20000000) + c0
	_intr_glue(?)
	_mi_switch(0,4,c,4006,0) + 4
	_tsleep(14d9cc,118,2ed98,0,dffeff0) + 12c
	_sys_select(2ae9db0,2bedf88,2bedf80) + 260
	_syscall(5d) + 126
	_trap0() + e

Only the arguments to _sys_select() were different each time. The test I
did was 'scp <remote>:/netbsd /dev/null' on  the console.
I have been doing something simular to this for a few weeks now and the
results seem to be consistent. About 90% of the problems originated from
mi_switch(). Something I don't really understand...

> The second case is more complicated.  You need to instrument the interrupt
> dispatch code to time each interrupt handler and break when it finds one
> that took too long.

I am brooding on this one, but shouldn't this be caught by the previous case?

> Finally, it may be an issue with acknowledging the interrupt to the H/W
> properly.  That I can't easily help you with.

Yes, true. I more or less came to the conclusion that I cannot prove right
or wrong without having additional equipment (analyser) that I just don't
have access to. But the fact that raising the IPL of the interrupt made
things better, I thought that to be (weak? )evidence of a software problem...

Leo.