Subject: Re: bge/ahd nterrupt problems
To: Frank van der Linden <fvdl@netbsd.org>
From: Edgar =?iso-8859-1?B?RnXf?= <ef@math.uni-bonn.de>
List: port-amd64
Date: 03/25/2007 16:04:38
> Let me know what you see.
OK, here is what I see with DDB when the server is in the "locked up" state:

Lots of
  ahd1: Timedout SCB already complete. Interrupts may not be functioning.

cpuvar:
  pending: 40000000
  level: D
  depth: 1

ioapics RDRs (i.e. write 2*i+10 to REG, read DATA):
  ioapic1 (the one ahd1 is on):
    0: E063
    1: E064 (i.e., the receipt bit is on)
    2, 3 disabled (10000)
  ioapic2:
    0,1,2,3: disabled
  ioapic0:
    00: 0700
    01: 0090
    03: 00D1
    04: 00D0
    09: A0A0
    0E: 0061
    0F: 0062
    12: A070
    13: A060
    rest disabled.

Setting a breakpoint on ahd_intr:
  Looks like getting only interrupts for ahd0 and none for ahd1.

I can also inspect ci_isources, but that doesn't make sense as long as
either I misunderstand which one should be handling the interrupt or
there is indeed confusion wrt. multiple IOAPICs.

I still have the machine more or less untouched (i.e. it still complains
about ahd1 timeouts). But I will now leave the server cellar in favour
or a bicycle ride. I can return later today if someone wants me to inspect
further hardware registers. Otherwise, I'll try to save the RAID parity and
try to get a dump. Then, I'll probably run the torture test with a non-IOAPIC
kernel.

Thanks for any hints what's going on.
I would really like this solved next week.