Subject: kern/32626: ehci + umass + stress = umass stall cycle
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Martin S. Weber <Ephaeton@gmx.net>
List: netbsd-bugs
Date: 01/25/2006 14:00:00
>Number:         32626
>Category:       kern
>Synopsis:       ehci + umass + stress = umass stall cycle
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 25 14:00:00 +0000 2006
>Originator:     Martin S. Weber
>Release:        NetBSD 3.99.15
>Organization:
	Entropie Regensburg!
	
>Environment:

The "Via" machine:

uhci0 at pci0 dev 2 function 0: VIA Technologies VT83C572 USB Controller (rev. 0x61)
uhci0: interrupting at ioapic1 pin 2 (irq 11)
usb0 at uhci0: USB revision 1.0
uhci1 at pci0 dev 2 function 1: VIA Technologies VT83C572 USB Controller (rev. 0x61)
uhci1: interrupting at ioapic1 pin 3 (irq 11)
usb1 at uhci1: USB revision 1.0
ehci0 at pci0 dev 2 function 2: VIA Technologies VT8237 EHCI USB Controller (rev. 0x63)
ehci0: interrupting at ioapic1 pin 2 (irq 11)
ehci0: BIOS has given up ownership
ehci0: EHCI version 1.0
ehci0: companion controllers, 2 ports each: uhci0 uhci1
usb2 at ehci0: USB revision 2.0
ohci0 at pci0 dev 15 function 2: ServerWorks OSB4/CSB5 USB Host Controller (rev. 0x04)
ohci0: interrupting at ioapic1 pin 1 (irq 9)
ohci0: OHCI version 1.0, legacy support
usb3 at ohci0: USB revision 1.0

The "Intel" machine:

ehci0 at pci0 dev 29 function 7: Intel 82801EB/ER USB EHCI Controller (rev. 0x02)
ehci0: interrupting at ioapic0 pin 23 (irq 5)
ehci0: BIOS has given up ownership
ehci0: EHCI version 1.0
ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 uhci3
usb4 at ehci0: USB revision 2.0

The "umass":

umass0 at uhub5 port 3 configuration 1 interface 0
umass0: Genesys Logic USB TO IDE, rev 2.00/0.33, addr 3
umass0: using SCSI over Bulk-Only
scsibus0 at umass0: 2 targets, 1 lun per target
sd0 at scsibus0 target 0 lun 0: <IC25N080, ATMR04-0, 0811> disk fixed
sd0: fabricating a geometry
sd0: 76319 MB, 76319 cyl, 64 head, 32 sec, 512 bytes/sect x 156301488 sectors
sd0: fabricating a geometry


	
	
System: NetBSD circe.entropie.net 3.99.15 NetBSD 3.99.15 (CIRCE.LEAN) #0: Sat Jan 21 14:53:01 CET 2006 root@circe.entropie.net:/src/obj/sys/arch/i386/compile/CIRCE.LEAN i386
Architecture: i386
Machine: i386
>Description:
	
When stressing an umass hanging on an ehci, problems can occur once the drive
is stressed. Adding the ..broken_intr_workaround makes the problems harder to
reproduce, yet it is still possible.

Note: I'm talking about the intel controller down there. The workaround as
imported for VIA controllers was/is also active for this Intel controller
(via patched ehci_pci as suggested by Juan RP)!

Symptoms: At some point the umass beginns stalling and doesn't ever recover.
The stalls make it impossible for the system to sync the drive. Thus it is
also impossible to umount the drive. It's furthermore impossible to kill anything
trying to write on the drive (They're stuck in biowait / ffsync) - not even with
kill -9.

Shutting down the system cleanly is impossible because the fs cannot be synced /
umounted. Hitting ctrl+alt+esc while it struggles and trying a 'sync' by hand
makes the machine hang completely (i.e. hard powerdown required).

This is reproducible by both intel and via controllers (see dmesgs in Environment)

Output from an EHCIDEBUG + UMASSDEBUG kernel with ehcidebug = umassdebug = 0
on the machine with the Intel Controller:

umass0: BBB reset failed, TIMEOUT
umass0: BBB bulk-in clear stall failed, TIMEOUT
umass0: BBB bulk-out clear stall failed, TIMEOUT
umass0: BBB reset failed, TIMEOUT
umass0: BBB bulk-in clear stall failed, TIMEOUT
umass0: BBB bulk-out clear stall failed, TIMEOUT
umass0: BBB reset failed, TIMEOUT
umass0: BBB bulk-in clear stall failed, TIMEOUT
umass0: BBB bulk-out clear stall failed, TIMEOUT
umass0: BBB reset failed, TIMEOUT
umass0: BBB bulk-in clear stall failed, TIMEOUT
umass0: BBB bulk-out clear stall failed, TIMEOUT
....(ad nauseatum)

Output from an EHCIDEBUG + UMASSDEBUG kernel with ehcidebug = umassdebug = 1
on the machine with the Intel Controller (WITHOUT WORKAROUND):

ehci0 at pci0 dev 29 function 7: Intel 82801EB/ER USB EHCI Controller (rev. 0x02)
ehci0: offs=32
ehci0: interrupting at ioapic0 pin 23 (irq 5)
ehci_pci_attach: companion uhci0
ehci_pci_attach: companion uhci1
ehci_pci_attach: companion uhci2
ehci_pci_attach: companion uhci3
ehci_dump_caps: legsup=0x00000001 legctlsts=0xc0000000
ehci0: BIOS has given up ownership
ehci_init: start
ehci0: EHCI version 1.0
ehci_init: sparams=0x104208
ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 uhci3
ehci_init: cparams=0x6871
ehci0: resetting
ehci0: flsize=1024
QH(0xcaeafe40) at 0x01914e40:
  link=0x01914e42<QH>
  endp=0x0000a000
    addr=0x00 inact=0 endpt=0 eps=2 dtc=0 hrecl=1
    mpl=0x0 ctl=0 nrl=0
  endphub=0x00000000
    smask=0x00 cmask=0x00 huba=0x00 port=0 mult=0
  curqtd=0x00000001<T>
Overlay qTD:
  next=0x00000001<T> altnext=0x00000001<T>
  status=0x00000040: toggle=0 bytes=0x0 ioc=0 c_page=0x0
    cerr=0 pid=0 stat=0x40<HALTED>
  buffer[0]=0x00000000
  buffer[1]=0x00000000
  buffer[2]=0x00000000
  buffer[3]=0x00000000
  buffer[4]=0x00000000
usb4 at ehci0: USB revision 2.0
(...)
(mount)
ehci_alloc_sqtd_chain: start len=65536
....
(start writing [rsync])
ehci_alloc_sqtd_chain: start len=65536
....
(start reading, too [*] * 1)
umass0: BBB reset failed, IOERROR
ehci_device_clear_toggle: epipe=0xc178da80 status=0x8c008148
umass0: BBB bulk-in clear stall failed, IOERROR
ehci_device_clear_toggle: epipe=0xc178d080 status=0x8c00
umass0: BBB bulk-out clear stall failed, IOERROR
umass0: BBB reset failed, IOERROR
ehci_device_clear_toggle: epipe=0xc178da80 status=0x8c008148
umass0: BBB bulk-in clear stall failed, IOERROR
ehci_device_clear_toggle: epipe=0xc178d080 status=0x1f8049
umass0: BBB bulk-out clear stall failed, IOERROR
umass0: BBB reset failed, IOERROR
ehci_device_clear_toggle: epipe=0xc178da80 status=0x8c008148
umass0: BBB bulk-in clear stall failed, IOERROR
ehci_device_clear_toggle: epipe=0xc178d080 status=0x1f8049
umass0: BBB bulk-out clear stall failed, IOERROR
umass0: BBB reset failed, IOERROR
...

Output from an EHCIDEBUG + UMASSDEBUG kernel with ehcidebug = umassdebug = 1
on the machine with the Intel Controller (*WITH* WORKAROUND):

(mount, start writing [rsync])
(also have reader [*] * 2, 3 at times run)
(all fine so far. Ok Seems stressable. Stop those [*]s.)
ehci_intrlist_timeout
ehci_alloc_sqtd_chain: start len=65536

(and here xmms tries to read the "next" mp3 from the umass)

ehci_alloc_sqtd_chain: start len=65536
ehci_intrlist_timeout
...(about 50 follow)...
ehci_intrlist_timeout
ehci_timeout: exfer=0xc1654900
ehci_timeout_task: xfer=0xc1654900
ehci_abort_xfer: xfer=0xc1654900 pipe=0xc1660000
ehci_intr1: door bell
ehci_intrlist_timeout
...(more follow)...
ehci_intrlist_timeout
ehci_timeout: exfer=0xc168a200
ehci_intrlist_timeout
ehci_timeout_task: xfer=0xc168a200
ehci_abort_xfer: xfer=0xc168a200 pipe=0xc1660180
ehci_intr1: door bell
ehci_idone: aborted xfer=0xc168a200
umass0: BBB reset failed, TIMEOUT
ehci_device_clear_toggle: epipe=0xc1660000 status=0x6008d80
...
ehci_intrlist_timeout
ehci_timeout: exfer=0xc168af00
ehci_intrlist_timeout
ehci_timeout_task: xfer=0xc168af00
ehci_abort_xfer: xfer=0xc168af00 pipe=0xc1660180
ehci_intr1: door bell
ehci_idone: aborted xfer=0xc168af00
umass0: BBB bulk-in clear stall failed, TIMEOUT
ehci_device_clear_toggle: epipe=0xc1660380 status=0x8c00
ehci_intrlist_timeout
....
ehci_intrlist_timeout
ehci_timeout: exfer=0xc1654300
ehci_timeout_task: xfer=0xc1654300
ehci_abort_xfer: xfer=0xc1654300 pipe=0xc1660180
ehci_intr1: door bell
umass0: BBB bulk-out clear stall failed, TIMEOUT
ehci_intrlist_timeout




Basically the same happens with the VIA controller (although due to lack of
syncing I lost the debug output when I had to kill the machine).

I *do* have a dump of a kernel which could not shutdown cleanly (don't ask
how I got along in creating that one) which can be made available to
interested parties of course.

See also:
	http://mail-index.netbsd.org/current-users/2006/01/23/0014.html
	http://mail-index.netbsd.org/current-users/2006/01/24/0007.html

And
	kern/26681 (which probably should be re-opened)

>How-To-Repeat:
I have an umass with a lot of files after two dir hierarchies, and thus using:
(ksh speak)
stress() {
	x="$1"
	while :; do
		for i in */*/*; do
			[ $(( $RANDOM % 5 )) -eq 3 ] || continue;
			cat "$i" > /tmp/junk.$x
		done
		echo "Run $x done.."
		sleep $(( $RANDOM % 30 ))
	done
}

Run a couple of these stress()ers on the drive. Additionally I run the
two:

rsync -avz /usr/src .

in some subdir on the drive as well as

while :; do
	dd if=/dev/zero of=junk.img bs=1m count=10240 progress=1
	rm -f junk.img
done

Just let these running. After "some seconds" up to "some hours" the drive
will stumble.

Note: I am not sure whether the drive just is "broken" and is experiencing
something like a "**ide: lost interrupt". No matter, the system shouldn't
be down like that.
	
>Fix:
Uh, dunno.
	

>Unformatted:
 	..including the ehci_pci + ehci fixes to include the
 	BROKEN_INTR_WORKAROUND.