port-sparc64: Re: SCSI reset hang on sparc64 1.5X and sunos binary

Subject: Re: SCSI reset hang on sparc64 1.5X and sunos binary
To: None <eeh@netbsd.org>
From: David Brownlee <abs@netbsd.org>
List: port-sparc64
Date: 09/23/2001 23:40:11
	Update on this - I believe the SCSI to be a false alarm.
	The hang happened again with no reference to esp in dmesg, and
	the same netbsd32_read()...soreceive() in the syscall trace.
	Looks like there might be something suspect in the netbsd32()
	socket code?

-- 
		David/absolute		-- www.netbsd.org: No hype required --


On 22 Sep 2001 eeh@netbsd.org wrote:

> | 	I've just had a couple of hangs on my Ultra1 (the first time
> | 	its misbehaved under 1.5X).
> |
> | 	Both times were while running the SunOS 4.x netscape binary, and
> | 	both resulted in a complete hang short of L1+A
> |
> | 	The last dmesg entries were:
> |
> | esp0: error:
> | csr=b2930a13<INT,ERR,DRAINING=0,IEN,ENDMA,DSBL_SCSI_DRN,BURST=0,TCI
> | esp0: DMA error; resetting
> | esp0: !TC on DATA XFER [intr 10, stat 87, step 4] prevphase 101, resid 1f0
> | esp0: waiting for SCSI Bus Reset to happen
> |
> | 	and trace reports (all via hamfisted c&p):
> |
> | zsc_intr_hard()
> | zshard()
> | intr_list_handler()
> | sparc_intr_retry(5a35ec8f, 0, 5a35ec8c, 2182620, 0, ffffffff) at sparc_intr_retry+0x48
> | soreceive(216de80, 1854040, eb85ac0, 2182620, 0, 18540b8) at soreceive+0x7b4
> | soo_read(893d70, e893da0, eb85ac0, 2170d00, 1, 104f420) at soo_read+0x20
> | dofileread(e8a16b0, b, 8903d70, fffffffd, fffffffd, e893da0) at dofileread+0x8c
> | sys_read(e8a16b0,  eb85c80, eb85dc0, fffffffd, 0, ffffb090) at sys_read+0x58
> | netbsd32_read(e8a16b0, eb85dd0, eb85dc0, 1152760, 800, 1) at netbsd32_read+0x24
> | syscall(eb85ed0, 3, 0, 40c515d4, 0, 775) at syscall+0x304
> | syscall_setup(b, ffffb68b, fffffffd, 23, f75800, 1) at syscall_setup+0x12c
> |
> | 	I'm running two wide SUN2.1G disks, and disk activity was light
> | 	(I'v run both into the ground for sustained periods without incident
> | 	otherwise).
> |
> | 	Does anyone have any thoughts on what might be up, or any additional
> | 	information I could get which might help?
>
> Well, it appears you're getting some sort of DMA error.  This is usually
> caused by either DMA to a page that's not mapped by the IOMMU or an error
> from the memory controller.
>
> I'd suggest running diagnostics on your memory subsystem.  If that checks
> out, insert a breakpoint in lsi64854_scsi_intr() where it prints out the
> DMA error message and then enter the PROM and dump the iommu fault status
> and fault address registers.  Alernatively, you can add async fault interrupt
> handlers to the sysio driver similar to the ones in psycho.
>
> Eduardo
>