Subject: kern/8474: UDMA problems may hang machine
To: None <gnats-bugs@gnats.netbsd.org>
From: None <bgrayson@ece.utexas.edu>
List: netbsd-bugs
Date: 09/22/1999 19:35:51
>Number:         8474
>Category:       kern
>Synopsis:       CRC checks/downgrading isn't bulletproof?
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Sep 22 19:35:00 1999
>Last-Modified:
>Originator:     Brian Grayson
>Organization:
	Parallel and Distributed Systems
	Electrical and Computer Engineering
	The University of Texas at Austin
>Release:        Sep 20, 1999
>Environment:
NetBSD k9.ece.utexas.edu 1.4K NetBSD 1.4K (K93) #31: Wed Sep 22 01:55:44 CDT 1999 bgrayson@k9.ece.utexas.edu:/home/src/sys/arch/i386/compile/K93 i386

>Description:
	I bought a 13G UDMA drive over the weekend.  On my cheap
	Socket7 motherboard, I tend to get a lot of CRC errors,
	and then the drive downgrades to PIO mode.  However, if
	the errors occur at certain times, the machine may hang
	or reboot spontaneously.

	Running bonnie was enough to hang the machine.  So was a
	tar xf on the UDMA drive.  Once, the machine panic'd.
	Usually, a fast dd won't cause problems.

	I'm completely speculating here, but perhaps if the failure
	occurs when reading in data, the system recovers with the
	downgrade/retries, but if the failure occurs when reading
	in metadata, the system hangs?

	I've tried several different cables, including one that
	is around 8 inches long with only a single drive connector.

	Here's the full dmesg, with some of the CRC errors:

NetBSD 1.4K (K93) #31: Wed Sep 22 01:55:44 CDT 1999
    bgrayson@k9.ece.utexas.edu:/home/src/sys/arch/i386/compile/K93
cpu0: family 5 model 2 step 2
cpu0: Intel Pentium (P54C) (586-class)
total memory = 32384 KB
avail memory = 27584 KB
using 430 buffers containing 1720 KB of memory
mainbus0 (root)
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o enabled, memory enabled
pchb0 at pci0 dev 0 function 0
pchb0: Acer Labs M1531 Host-PCI Bridge (rev. 0xb2)
pcib0 at pci0 dev 2 function 0
pcib0: Acer Labs M1543 PCI-ISA Bridge (rev. 0xb4)
vr0 at pci0 dev 4 function 0: VIA VT3043 (Rhine) 10/100 Ethernet
vr0: interrupting at irq 11
vr0: Ethernet address: 00:80:c8:f9:03:8e
ukphy0 at vr0 phy 8: Generic IEEE 802.3u media interface
ukphy0: OUI 0x006040, model 0x0000, rev. 0
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
S3 Trio64V2/DX (VGA display, revision 0x14) at pci0 dev 5 function 0 not configured
pciide0 at pci0 dev 11 function 0: Acer Labs M5229 UDMA IDE Controller
pciide0: bus-master DMA support present
pciide0: primary channel configured to compatibility mode
wd0 at pciide0 channel 0 drive 0: <Maxtor 91303D6>
wd0: drive supports 16-sector pio transfers, lba addressing
wd0: 12427MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 25450992 sectors
wd0: 32-bits data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2
pciide0: primary channel interrupting at irq 14
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfers)
pciide0: secondary channel configured to compatibility mode
wd1 at pciide0 channel 1 drive 0: <QUANTUM BIGFOOT_CY4320A>
wd1: drive supports 32-sector pio transfers, lba addressing
wd1: 4134MB, 8960 cyl, 15 head, 63 sec, 512 bytes/sect x 8467200 sectors
wd1: 32-bits data port
wd1: drive supports PIO mode 4, DMA mode 2
pciide0: secondary channel interrupting at irq 15
wd1(pciide0:1:0): using PIO mode 4, DMA mode 2 (using DMA data transfers)
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns8250 or ns16450, no fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
com2 at isa0 port 0x3e8-0x3ef irq 5: ns8250 or ns16450, no fifo
lpt0 at isa0 port 0x378-0x37b irq 7
lptprobe: mask ff data 55 failed
lptprobe: mask ff data 55 failed
ix0 at isa0 port 0x300-0x30f iomem 0xd8000-0xdffff irq 10 address 00:aa:00:58:62:28, type EtherExpress/16
pcppi0 at isa0 port 0x61
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
WARNING: Pentium FDIV bug detected!
vt0 at isa0 port 0x60-0x6f irq 1
vt0: unknown s3, 80 col, color, 8 scr, mf2-kbd, [R3.32]
vt0: console
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
isapnp0: no ISA Plug 'n Play devices found
biomask c040 netmask cc40 ttymask ccc2
boot device: wd0
root on wd0a dumps on wd0b
mountroot: trying coda...
mountroot: trying msdos...
mountroot: trying cd9660...
mountroot: trying nfs...
mountroot: trying lfs...
mountroot: trying ext2fs...
mountroot: trying ffs...
root file system type: ffs
init: copying out path `/sbin/init' 11
wd0a:  aborted command, interface CRC error reading fsbn 18048 of 18048-18175 (wd0 bn 18111; cn 17 tn 15 sn 30), retrying
wd0: soft error (corrected)


  Here are some CRC errors from the previous time the machine was up:
Sep 21 22:54:01 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 15872 of 15872-15999 (wd0 bn 15935; cn 15 tn 12 sn 59), retrying
Sep 21 22:54:02 k9 /netbsd: wd0: soft error (corrected)
Sep 21 22:54:02 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 21504 of 21504-21631 (wd0 bn 21567; cn 21 tn 6 sn 21), retrying
Sep 21 22:54:02 k9 /netbsd: wd0: soft error (corrected)
Sep 21 22:54:03 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 24064 of 24064-24191 (wd0 bn 24127; cn 23 tn 14 sn 61), retrying
Sep 21 22:54:03 k9 /netbsd: wd0: soft error (corrected)
Sep 21 22:54:04 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 39808 of 39808-39935 (wd0 bn 39871; cn 39 tn 8 sn 55), retrying
Sep 21 22:54:04 k9 /netbsd: wd0: soft error (corrected)
Sep 21 22:54:06 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 88960 of 88960-89087 (wd0 bn 89023; cn 88 tn 5 sn 4), retrying
Sep 21 22:54:06 k9 /netbsd: wd0: soft error (corrected)
Sep 21 22:54:07 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 98176 of 98176-98303 (wd0 bn 98239; cn 97 tn 7 sn 22), retrying
Sep 21 22:54:07 k9 /netbsd: wd0: soft error (corrected)
Sep 21 22:54:07 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 99456 of 99456-99583 (wd0 bn 99519; cn 98 tn 11 sn 42), retrying
Sep 21 22:54:08 k9 /netbsd: wd0: transfer error, downgrading to PIO mode 4
Sep 21 22:54:09 k9 /netbsd: wd0(pciide0:0:0): using PIO mode 4
Sep 21 22:54:09 k9 /netbsd: wd0a:  aborted command, interface CRC error reading fsbn 99456 of 99456-99583 (wd0 bn 99519; cn 98 tn 11 sn 42), retrying
Sep 21 22:54:09 k9 /netbsd: wd0: soft error (corrected)

  I will try to boot a serial kernel and get a backtrace, and
append it to this PR, later on tonight.
>How-To-Repeat:
	
>Fix:
	For now, my /etc/rc does a 100M dd of the raw partition,
	to try to force the CRC failure/downgrading to occur
	before the system is highly active.  Unfortunately,
	that isn't always sufficient!  So I'll probably just
	disable UDMA in my kernel.
	
>Audit-Trail:
>Unformatted: