Subject: isp(4) with Q-Logic 2340 suffers "stray interrupts" under load
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.org>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 03/26/2005 17:05:46
[ On Saturday, March 26, 2005 at 12:05:56 (-0700), Michael L. Hitch wrote: ]
> Subject: Re: does "stray ... irq; stopped logging" also mean stopped interrupts?
>
> Oh, then what about this:
> 
>                 if (ALPHA_SHARED_INTR_DISABLE(dec_6600_pci_intr, irq))
>                         dec_6600_intr_disable(irq);
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yeah, I spotted that too just shortly after posting, but by then my
brain was too tired to do anything about it....  :-)


OK, so, as y'all probably know already, I'm working on an ES40 with,
amongst other things, dual Q-Logic 2340 cards (Q Logic product 0x2312,
dual-channel FC-AL 2Gbps HBA) hooked up to an Apple Xserve RAID box.
The system is running something approximating NetBSD-1.6.x.

Everything about these cards and the RAID array has been working very
well so far, with quite good performance given the circumstances.
(wish I could say the same about Gig-E cards! :-)

I have sometimes noticed the very occasional "stray 6600 irq 36" message
on the console though (where the IRQ number refers to one of the isp(4)
devices).

However when I started to do some rather intensive shuffling about of
many small files on the system then suddenly the frequency of these
stray IRQ messages went up until enough had happened to hit the default
maximum of 5, at which point this IRQ was turned off and as you can
expect the system came to a rather messy "halt" of sorts (with all sorts
of unrelated processes stuck in vnlock).

So hoping that I could avoid the inevitable "have you tried -current?"
kinds of questions, and because I saw a number of ISR related fixes
specific to "23xx" cards in the commit logs, but still wanting to stay
as close as possible to my soon-to-be-production environment, I
back-ported the -current isp(4) driver to 1.6.x (a simple matter of
updating sys/dev/ic/isp* and sys/dev/pci/isp* it seems, with the
exception of the expected CFATTACH_DECL() patch in the PCI attachment,
(why can't ethernet drivers be so easy to backport?)) and booted it up
this afternoon for another test, but as you'll see in the dmesg output
below the new driver did not solve the problem -- in fact it has made no
appreciable change at all.

Any clues or assistance would be very much appreciated.  Moving files
around into place is the last step to putting this machine into
production, and we're rather eager to do that ASAP.  Remote console
access can be made available if necessary, and I can even build and boot
and test -current kernels, provided they won't damage my filesystems!
(any more than my 1.6.x kernels might ;-)

My only idea at the moment is to "fix" the interrupt disabling logic so
that the stray counter is reset (and thus the maxstrays is never reached
and interrupts remain enabled), after any subsequent interrupt is
successfully handled by a driver.  In my way of thinking so far I don't
see any point to disabling an IRQ if the driver is handing most (or even
enough) of them.  I can certainly live with the occasional stray so long
as things keep working well enough, which seems to be the case with this
particular driver and device.  I don't even need to stop logging the
strays so long as they only happen occasionally under heavy load
(i.e. so long as they don't drag down the system with excessive console
and syslog I/O).  The other obvious alternative hack would be to just
remove the dec_6600_intr_disable() shown above.


NetBSD 1.6.2_STABLE (TSUNAMI.MP) #16: Sat Mar 26 15:48:09 EST 2005
    woods@whats:/build/woods/whats/NetBSD-1.6.x-alpha-alpha-21164a-obj/building/work/woods/m-NetBSD-1.6/sys/arch/alpha/compile/TSUNAMI.MP
AlphaServer ES40, 666MHz, s/n NI94900217
8192 byte page size, 4 processors.
total memory = 16384 MB
(7080 KB reserved for PROM, 16377 MB used by NetBSD)
avail memory = 14237 MB
using 12288 buffers containing 1637 MB of memory
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21264A-14
cpu0: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu1 at mainbus0: ID 1, 21264A-14
cpu1: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu2 at mainbus0: ID 2, 21264A-14
cpu2: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu3 at mainbus0: ID 3, 21264A-14
cpu3: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
tsc0 at mainbus0: 21272 Core Logic Chipset, Cchip rev 0
tsc0: 8 Dchips, 2 memory buses of 32 bytes
tsc0: arrays present: 4096MB (split), 4096MB (split), 4096MB (split), 4096MB (split), Dchip 0 rev 1
tsp0 at tsc0
pci0 at tsp0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
vga0 at pci0 dev 1 function 0: ATI Technologies 3D Rage II+ (rev. 0x9a)
pci_mem_find: void region
pci_mem_find: void region
pci_mem_find: void region
wsdisplay0 at vga0 (kbdmux ignored)
ahc0 at pci0 dev 2 function 0
ahc0: interrupting at dec 6600 irq 12
ahc0: aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc0: 16 targets, 8 luns per target
ahc1 at pci0 dev 2 function 1
ahc1: interrupting at dec 6600 irq 13
ahc1: aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
scsibus1 at ahc1: 16 targets, 8 luns per target
isp0 at pci0 dev 3 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp0: interrupting at dec 6600 irq 16
isp0: bad execution throttle of 0- using 16
scsibus2 at isp0: 256 targets, 8 luns per target
tlp0 at pci0 dev 4 function 0: DECchip 21143 Ethernet, pass 3.0
tlp0: interrupting at dec 6600 irq 20
tlp0: DEC DE500-BA, Ethernet address 08:00:2b:c4:b5:26
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
sio0 at pci0 dev 7 function 0: Acer Labs M1543 PCI-ISA Bridge (rev. 0xc3)
Acer Labs M5229 UDMA IDE Controller (IDE mass storage, interface 0xfa, revision 0xc1) at pci0 dev 15 function 0 not configured
Acer Labs M5237 USB 1.1 Host Controller (USB serial bus, interface 0x10, revision 0x03) at pci0 dev 19 function 0 not configured
isa0 at sio0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
tsp1 at tsc0
pci1 at tsp1 bus 0
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
siop0 at pci1 dev 2 function 0: Symbios Logic 53c895 (ultra2-wide scsi)
siop0: using on-board RAM
siop0: interrupting at dec 6600 irq 28
scsibus3 at siop0: 16 targets, 8 luns per target
wm0 at pci1 dev 3 function 0: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm0: interrupting at dec 6600 irq 32
wm0: Ethernet address 00:04:23:a8:79:28
wm0: 1000baseSX, 1000baseSX-FDX, auto
wm1 at pci1 dev 3 function 1: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm1: interrupting at dec 6600 irq 33
wm1: Ethernet address 00:04:23:a8:79:28
wm1: 1000baseSX, 1000baseSX-FDX, auto
isp1 at pci1 dev 4 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp1: interrupting at dec 6600 irq 36
isp1: bad execution throttle of 0- using 16
scsibus4 at isp1: 256 targets, 8 luns per target
tlp1 at pci1 dev 5 function 0: DECchip 21143 Ethernet, pass 3.0
tlp1: interrupting at dec 6600 irq 40
tlp1: DEC DE500-BA, Ethernet address 08:00:2b:c4:7a:70
tlp1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
bge0 at pci1 dev 6 function 0: Broadcom BCM5703X Gigabit Ethernet
bge0: interrupting at dec 6600 irq 44
bge0: ASIC BCM5703 A2, Ethernet address 00:08:02:91:89:ae
brgphy0 at bge0 phy 1: BCM5703 1000BASE-T media interface, rev. 2
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
scsibus0: waiting 2 seconds for devices to settle...
cd0 at scsibus0 target 4 lun 0: <TOSHIBA, CD-ROM XM-5701TA, 0557> SCSI2 5/cdrom removable
cd0: sync (100.0ns offset 8), 8-bit (10.000MB/s) transfers
scsibus1: waiting 2 seconds for devices to settle...
sd0 at scsibus1 target 0 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd0: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd0: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd1 at scsibus1 target 1 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd1: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd1: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd2 at scsibus1 target 2 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd2: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd2: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd3 at scsibus1 target 4 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd3: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd3: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
sd4 at scsibus1 target 5 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd4: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd4: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
scsibus2: waiting 2 seconds for devices to settle...
sd5 at scsibus2 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd5: 1168 GB, 149605 cyl, 128 head, 128 sec, 512 bytes/sect x 2451128320 sectors
scsibus2 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus3: waiting 2 seconds for devices to settle...
scsibus4: waiting 2 seconds for devices to settle...
sd6 at scsibus4 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd6: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
sd7 at scsibus4 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd7: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
scsibus4 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
Kernelized RAIDframe activated
RAIDframe: Searching for RAID components...
RAIDframe: Component on: sd0a: 71132000
   Row: 0 Column: 1 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 1412893 Mod Counter: 186
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 71131904
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
RAIDframe: Component on: sd1a: 71132000
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 1412893 Mod Counter: 186
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 71131904
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
RAIDframe: the sd device sd2a has invalid RAID component label
RAIDframe: attempting to find a root-enabled RAID set...
RAIDframe: Found: sd1a at 0
RAIDframe: Found: sd0a at 1
RAIDframe autoconfigure
RAIDframe: Configuring raid0:
raid0: RAID Level 1
raid0: Components: /dev/sd1a /dev/sd0a
raid0: Total Sectors: 71131904 (34732 MB)
root on raid0a dumps on raid0b
IP Filter: v3.4.29 initialized.  Default = pass all, Logging = enabled
stray 6600 irq 36
stray 6600 irq 36
stray 6600 irq 36


Note that this time I stopped the file shuffling before the fifth stray
IRQ warning and thus saved myself the hassle of a hard reset and fsck.  :-)

BTW, those "offline not configured" messages about the additional LUNs
on the Xserve RAID targets appear to have first showed up after we
upgraded the array's firmware to the current 1.26 version.

-- 
						Greg A. Woods

H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>