Subject: Re: measuring interrupt latency etc.
To: Leo Weppelman <leo@wau.mis.ah.nl>
From: Eduardo Horvath <eeh@turbolinux.com>
List: tech-kern
Date: 05/03/2000 08:49:31
On Wed, 3 May 2000, Leo Weppelman wrote:

> On Tue, May 02, 2000 at 09:06:20AM -0700, Eduardo Horvath wrote:
> > On Tue, 2 May 2000, Leo Weppelman wrote:

> > This sort of problem usually manifests itself in two ways.  Either there's
> > a critical section that blocks interrupts for too long, or there's a
> > higher level interrupt handler that takes too long.  
> > 
> > Detecting the cause for the first case is quite simple.  Test for the
> > delay in your interrupt handler and when you do break in to DDB.  The
> > interrupt should have occured just after the instruction that re-enabled
> > the interrupt, so all you need to do is find the appropriate trap frame
> > and figure out what instruction that is.  The problem code should
> > immediately precede that.
> 
> OK, tried to do this. What I did was to call a function toolate() when the
> ISDN interrupt module found out that it had an overrun. The toolate() function
> just called the debugger. Last night, all frames caught were of the form:
> 
> 	_toolate(2680,0,0,80,0) + a6
> 	_isic_hscx_irq(268c00,80,1,0,2604) + 17a
> 	_isicintr(268c00,2000,4f,28dce0,2beddc4) + 4a
> 	_iifun(1,2000,0,1,1) + 120
> 	_intr_dispatch(20,0,154efc,15b3d4,20000000) + c0
> 	_intr_glue(?)
> 	_mi_switch(0,4,c,4006,0) + 4
> 	_tsleep(14d9cc,118,2ed98,0,dffeff0) + 12c
> 	_sys_select(2ae9db0,2bedf88,2bedf80) + 260
> 	_syscall(5d) + 126
> 	_trap0() + e
> 
> Only the arguments to _sys_select() were different each time. The test I
> did was 'scp <remote>:/netbsd /dev/null' on  the console.
> I have been doing something simular to this for a few weeks now and the
> results seem to be consistent. About 90% of the problems originated from
> mi_switch(). Something I don't really understand...

mi_switch() does not do any spl() stuff so it's not blocking
interrupts.  And it lools like you took the interrupt at the very
beginning of mi_switch() so cpu_switch() isn't the culprit either.

I think what's happening is that the user process is waiting for input at
the time the problem occurs and you're just seeing a statistical
anomaly.  This test indicates that the problem lies elsewhere.

> > The second case is more complicated.  You need to instrument the interrupt
> > dispatch code to time each interrupt handler and break when it finds one
> > that took too long.
> 
> I am brooding on this one, but shouldn't this be caught by the previous case?

No.  The first case is where some section of code prevents your interrupt
from pre-empting it by disabling interrupts.  The second case is where
your code starts running and itself is preempted by something else.

> > Finally, it may be an issue with acknowledging the interrupt to the H/W
> > properly.  That I can't easily help you with.
> 
> Yes, true. I more or less came to the conclusion that I cannot prove right
> or wrong without having additional equipment (analyser) that I just don't
> have access to. But the fact that raising the IPL of the interrupt made
> things better, I thought that to be (weak? )evidence of a software problem...

You don't need an analyzer, just instrument some code.  

The following patch should instrument regular interrupts (I have not tested
it, not having an Atari) but there are others that are dispatched directly
from locore.s that this will not catch.  I chose to use microtime() since it
should exist in all ports and have microsecond resolution.  Unfortunately it's
usually not the most efficient routine.  If you have a better (preferably
hardware) clock source, by all means use it.

For those someone who knows more about the architecture, primarily the system
clock, will need to do the coding.

Index: sys/arch/atari/atari/intr.c
===================================================================
RCS file: /cvsroot/syssrc/sys/arch/atari/atari/intr.c,v
retrieving revision 1.7
diff -c -r1.7 intr.c
*** intr.c      1999/09/17 19:59:40     1.7
--- intr.c      2000/05/03 15:40:25
***************
*** 309,317 ****
        ih->ih_intrcnt[0]++;
  
        /* Give all the handlers a chance. */
!       for ( ; ih != NULL; ih = ih->ih_link.le_next)
                handled |= (*ih->ih_fun)((ih->ih_type & ARG_CLOCKFRAME)
                                        ? &frame : ih->ih_arg, frame.cf_sr);
  
        if (handled)
            straycount = 0;
--- 309,324 ----
        ih->ih_intrcnt[0]++;
  
        /* Give all the handlers a chance. */
!       for ( ; ih != NULL; ih = ih->ih_link.le_next) {
!               struct timeval start, end;
! 
!               microtime(&start);
                handled |= (*ih->ih_fun)((ih->ih_type & ARG_CLOCKFRAME)
                                        ? &frame : ih->ih_arg, frame.cf_sr);
+               microtime(&end);
+               if (end.tv_usec - start.tv_usec < 1000 /* 1 milisec */)
+                       printf("Intr took too long: %p\n", ih->ih_fun);
+       }
  
        if (handled)
            straycount = 0;



Eduardo Horvath