Subject: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Greg A. Woods <woods@planix.com>
List: netbsd-bugs
Date: 04/10/2005 19:36:00
>Number:         29936
>Category:       kern
>Synopsis:       isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Apr 10 19:36:00 +0000 2005
>Originator:     Greg A. Woods
>Release:        NetBSD 1.6.2_STABLE
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD 1.6.2_STABLE
Architecture: alpha
Machine: alpha

isp(4) from -current:

ic/isp.c:
     $NetBSD: isp.c,v 1.106 2005/02/27 00:27:01 perry Exp $
     $NetBSD: isp.c,v 1.106 2005/02/27 00:27:01 perry Exp $

ic/isp_inline.h:
     $NetBSD: isp_inline.h,v 1.25 2005/02/27 00:27:01 perry Exp $

ic/isp_ioctl.h:
     $NetBSD: isp_ioctl.h,v 1.6 2005/02/27 00:27:01 perry Exp $

ic/isp_netbsd.c:
     $NetBSD: isp_netbsd.c,v 1.65 2005/02/27 00:27:01 perry Exp $
     $NetBSD: isp_netbsd.c,v 1.65 2005/02/27 00:27:01 perry Exp $

ic/isp_netbsd.h:
     $NetBSD: isp_netbsd.h,v 1.53 2005/02/27 00:27:01 perry Exp $

ic/isp_target.c:
     $NetBSD: isp_target.c,v 1.27 2005/02/27 00:27:01 perry Exp $
     $NetBSD: isp_target.c,v 1.27 2005/02/27 00:27:01 perry Exp $

ic/isp_target.h:
     $NetBSD: isp_target.h,v 1.21 2003/12/04 13:57:30 keihan Exp $

ic/isp_tpublic.h:
     $NetBSD: isp_tpublic.h,v 1.13 2005/02/27 00:27:01 perry Exp $

ic/ispmbox.h:
     $NetBSD: ispmbox.h,v 1.48 2005/02/27 00:27:01 perry Exp $

ic/ispreg.h:
     $NetBSD: ispreg.h,v 1.29 2003/12/04 13:57:30 keihan Exp $

ic/ispvar.h:
     $NetBSD: ispvar.h,v 1.62 2005/02/27 00:27:01 perry Exp $

pci/isp_pci.c:
     $NetBSD: isp_pci.c,v 1.92 2005/02/27 00:27:33 perry Exp $
     $NetBSD: isp_pci.c,v 1.92 2005/02/27 00:27:33 perry Exp $

sbus/isp_sbus.c:
     $NetBSD: isp_sbus.c,v 1.53 2002/05/18 00:48:11 mjacob Exp $
     $NetBSD: isp_sbus.c,v 1.53 2002/05/18 00:48:11 mjacob Exp $

microcode/isp/asm_1040.h:
     $NetBSD: asm_1040.h,v 1.3 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_1080.h:
     $NetBSD: asm_1080.h,v 1.3 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_12160.h:
     $NetBSD: asm_12160.h,v 1.5 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_2100.h:
     $NetBSD: asm_2100.h,v 1.6 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_2200.h:
     $NetBSD: asm_2200.h,v 1.6 2005/02/27 00:27:24 perry Exp $

microcode/isp/asm_2300.h:
     $NetBSD: asm_2300.h,v 1.6 2005/02/27 00:27:24 perry Exp $

microcode/isp/asm_sbus.h:
     $NetBSD: asm_sbus.h,v 1.18 2005/02/27 00:27:24 perry Exp $

microcode/isp/isp_1000.bin:
ident warning: no id keywords in microcode/isp/isp_1000.bin

microcode/isp/isp_1040.bin:
ident warning: no id keywords in microcode/isp/isp_1040.bin

microcode/isp/isp_1080.bin:
ident warning: no id keywords in microcode/isp/isp_1080.bin

microcode/isp/isp_12160.bin:
ident warning: no id keywords in microcode/isp/isp_12160.bin

microcode/isp/isp_2100.bin:
ident warning: no id keywords in microcode/isp/isp_2100.bin

microcode/isp/isp_2200.bin:
ident warning: no id keywords in microcode/isp/isp_2200.bin

microcode/isp/isp_2300.bin:
ident warning: no id keywords in microcode/isp/isp_2300.bin


>Description:

	What a way to spoil a beautiful bright sunny spring weekend.

	A filesystem hangs bringing the system into a state of very
	reduced functionality and the console says:

	    isp1: unable to load DMA (35)
	    sd6(isp1:0:1:0): adapter resource shortage

	This is from an alphaserver es40 with twin Qlogic ISP 2312 HBAs
	connected to an Apple Xserve RAID array.

	The isp(4) driver is from -current a few weeks ago, but the rest
	of the kernel is essentially from the head of the netbsd-1-6
	branch.

	This also seems to affect the adaptec card to which the root
	disks are attached as "root" cannot login on the console, though
	existing processes continue to run (e.g. shells and sshds), as
	does most of Cyrus IMAP thus allowing users to continue to fetch
	mail (through the other still running isp(4) controller).  Only
	/home and /var/log, on sd6, are known to be frozen, though this
	doesn't really explain why root can't login on the console
	(unless the attempt to write the login record to /var/log/wtmp
	is what's stopping that).

	A manual halt doesn't seem to get anywhere:

	    RMC>halt in
	    
	    Returning to COM port
	    
	    halted CPU 0
	    CPU 1 is not halted
	    CPU 2 is not halted
	    CPU 3 is not halted
	    
	    halt code = 1
	    operator initiated halt
	    PC = fffffc0000300750
	    P00>>>cont
	    
	    continuing CPU 0
	    CP - RESTORE_TERM routine to be called
	    panic: user requested console halt
	    Begin traceback...
	    alpha trace requires known PC =eject=
	    End traceback...
	    syncing disks... 
	    CPU 0: fatal kernel trap:
	    
	    CPU 0    trap entry = 0x2 (memory management fault)
	    CPU 0    a0         = 0x1a4
	    CPU 0    a1         = 0x1
	    CPU 0    a2         = 0x0
	    CPU 0    pc         = 0xfffffc000042ccac
	    CPU 0    ra         = 0xfffffc00003a5a18
	    CPU 0    pv         = 0xfffffc000042c9e0
	    CPU 0    curproc    = 0x0
	    
	    panic: trap
	    Begin traceback...
	    alpha trace requires known PC =eject=
	    End traceback...
	    cpu3: shutting down...
	    cpu2: shutting down...
	    cpu1: shutting down...
	    
	    CPU 0: fatal kernel trap:
	    
	    CPU 0    trap entry = 0x2 (memory management fault)
	    CPU 0    a0         = 0x1a4
	    CPU 0    a1         = 0x1
	    CPU 0    a2         = 0x0
	    CPU 0    pc         = 0xfffffc00004a4764
	    CPU 0    ra         = 0xfffffc00004a65b0
	    CPU 0    pv         = 0xfffffc00004a4700
	    CPU 0    curproc    = 0x0
	    
	    panic: trap
	    Begin traceback...
	    alpha trace requires known PC =eject=
	    End traceback...

	The machine is still hung at this point and another toggle (and
	a half) of the virtual halt button is necessary before I can
	force it to boot from the SRM firmware.

	The system has run for months in test with no problem like this,
	sometimes moving several gigabytes of data between the RAID
	partitions.  It then ran for almost five days in production as a
	mail and web server before suddenly hanging like this during a
	relatively quite period of activity late on Friday afternoon.
	It ran another day and a half and now sometime this (Sunday)
	morning has done it a second time; GRRR, and while I type this
	something similar seems to be happenning though this time
	there's no response from any running process, no ping response,
	no console response.  Time for a cold boot.  OK, back up and
	running but still re-sync'ing the root mirror....

	Could this be a driver bug, an Xserve RAID bug, a hardware
	problem, or what?

NetBSD 1.6.2_STABLE (TSUNAMI.MP) #21: Sat Apr  2 06:35:47 EST 2005
    woods@whats:/build/woods/whats/NetBSD-1.6.x-alpha-alpha-21164a-obj/building/work/woods/m-NetBSD-1.6/sys/arch/alpha/compile/TSUNAMI.MP
AlphaServer ES40, 666MHz, s/n NI94900217
8192 byte page size, 4 processors.
total memory = 16384 MB
(7080 KB reserved for PROM, 16377 MB used by NetBSD)
avail memory = 14236 MB
using 16384 buffers containing 1637 MB of memory
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21264A-14
cpu0: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu1 at mainbus0: ID 1, 21264A-14
cpu1: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu2 at mainbus0: ID 2, 21264A-14
cpu2: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu3 at mainbus0: ID 3, 21264A-14
cpu3: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
tsc0 at mainbus0: 21272 Core Logic Chipset, Cchip rev 0
tsc0: 8 Dchips, 2 memory buses of 32 bytes
tsc0: arrays present: 4096MB (split), 4096MB (split), 4096MB (split), 4096MB (split), Dchip 0 rev 1
tsp0 at tsc0
pci0 at tsp0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
vga0 at pci0 dev 1 function 0: ATI Technologies 3D Rage II+ (rev. 0x9a)
pci_mem_find: void region
pci_mem_find: void region
pci_mem_find: void region
wsdisplay0 at vga0 (kbdmux ignored)
ahc0 at pci0 dev 2 function 0
ahc0: interrupting at dec 6600 irq 12
ahc0: aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc0: 16 targets, 8 luns per target
ahc1 at pci0 dev 2 function 1
ahc1: interrupting at dec 6600 irq 13
ahc1: aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
scsibus1 at ahc1: 16 targets, 8 luns per target
isp0 at pci0 dev 3 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp0: interrupting at dec 6600 irq 16
isp0: bad execution throttle of 0- using 16
scsibus2 at isp0: 256 targets, 8 luns per target
tlp0 at pci0 dev 4 function 0: DECchip 21143 Ethernet, pass 3.0
tlp0: interrupting at dec 6600 irq 20
tlp0: DEC DE500-BA, Ethernet address 08:00:2b:c4:b5:26
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
sio0 at pci0 dev 7 function 0: Acer Labs M1543 PCI-ISA Bridge (rev. 0xc3)
Acer Labs M5229 UDMA IDE Controller (IDE mass storage, interface 0xfa, revision 0xc1) at pci0 dev 15 function 0 not configured
Acer Labs M5237 USB 1.1 Host Controller (USB serial bus, interface 0x10, revision 0x03) at pci0 dev 19 function 0 not configured
isa0 at sio0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
tsp1 at tsc0
pci1 at tsp1 bus 0
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
siop0 at pci1 dev 2 function 0: Symbios Logic 53c895 (ultra2-wide scsi)
siop0: using on-board RAM
siop0: interrupting at dec 6600 irq 28
scsibus3 at siop0: 16 targets, 8 luns per target
wm0 at pci1 dev 3 function 0: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm0: interrupting at dec 6600 irq 32
wm0: Ethernet address 00:04:23:a8:79:28
wm0: 1000baseSX, 1000baseSX-FDX, auto
wm1 at pci1 dev 3 function 1: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm1: interrupting at dec 6600 irq 33
wm1: Ethernet address 00:04:23:a8:79:28
wm1: 1000baseSX, 1000baseSX-FDX, auto
isp1 at pci1 dev 4 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp1: interrupting at dec 6600 irq 36
isp1: bad execution throttle of 0- using 16
scsibus4 at isp1: 256 targets, 8 luns per target
tlp1 at pci1 dev 5 function 0: DECchip 21143 Ethernet, pass 3.0
tlp1: interrupting at dec 6600 irq 40
tlp1: DEC DE500-BA, Ethernet address 08:00:2b:c4:7a:70
tlp1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
bge0 at pci1 dev 6 function 0: Broadcom BCM5703X Gigabit Ethernet
bge0: interrupting at dec 6600 irq 44
bge0: ASIC BCM5703 A2, Ethernet address 00:08:02:91:89:ae
brgphy0 at bge0 phy 1: BCM5703 1000BASE-T media interface, rev. 2
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
scsibus0: waiting 2 seconds for devices to settle...
cd0 at scsibus0 target 4 lun 0: <TOSHIBA, CD-ROM XM-5701TA, 0557> SCSI2 5/cdrom removable
cd0: sync (100.0ns offset 8), 8-bit (10.000MB/s) transfers
scsibus1: waiting 2 seconds for devices to settle...
sd0 at scsibus1 target 0 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd0: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd0: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd1 at scsibus1 target 1 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd1: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd1: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd2 at scsibus1 target 2 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd2: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd2: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd3 at scsibus1 target 4 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd3: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd3: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
sd4 at scsibus1 target 5 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd4: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd4: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
scsibus2: waiting 2 seconds for devices to settle...
sd5 at scsibus2 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd5: 1168 GB, 149605 cyl, 128 head, 128 sec, 512 bytes/sect x 2451128320 sectors
scsibus2 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus3: waiting 2 seconds for devices to settle...
scsibus4: waiting 2 seconds for devices to settle...
sd6 at scsibus4 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd6: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
sd7 at scsibus4 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd7: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
scsibus4 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
Kernelized RAIDframe activated
RAIDframe: Searching for RAID components...
RAIDframe: Component on: sd0a: 71132000
   Row: 0 Column: 1 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 1412893 Mod Counter: 232
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 71131904
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
RAIDframe: Component on: sd1a: 71132000
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 1412893 Mod Counter: 232
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 71131904
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
RAIDframe: the sd device sd2a has invalid RAID component label
RAIDframe: attempting to find a root-enabled RAID set...
RAIDframe: Found: sd1a at 0
RAIDframe: Found: sd0a at 1
RAIDframe autoconfigure
RAIDframe: Configuring raid0:
raid0: RAID Level 1
raid0: Components: /dev/sd1a /dev/sd0a
raid0: Total Sectors: 71131904 (34732 MB)
root on raid0a dumps on raid0b

	Note that there was a very large mess to clean up in all the
	filesystems after the first hang as they were all mounted with
	softdep.

	Softdep is definitely _not_ safe for production if there's any
	risk whatsoever of any crash.  All recently created files were
	moved to lost+found directories and with something like a busy
	Cyrus IMAP server that's as good as losing them since their
	positions in mailbox indexes cannot be recovered.

	Softdep is no longer in use and will be removed from my
	production kernel for the forseable future.

	Luckily the Xserve RAID has its full 512MB of cache RAM and
	everything's on a good solid UPS with a generator behind it, so
	the performance reduction will not be so perceivable.

>How-To-Repeat:

>Fix:

	maybe I should at least turn that printf() into a panic()?

	(but then it will certainly just hang at syncing disks...)

-- 
My first fortune-of-the-day after the second reboot today:

Screw up your courage!  You've screwed up everything else.