Subject: kern/26873: wd driver error recovery broken
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <kardel@Orcus.project.Acrys.COM>
List: netbsd-bugs
Date: 09/07/2004 09:48:08
>Number:         26873
>Category:       kern
>Synopsis:       wd driver fails to correctly recover on "interface CRC error"
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Sep 07 08:28:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     kardel
>Release:        NetBSD 2.0G current 20040831
>Organization:
	
>Environment:
System: NetBSD Orcus 2.0G NetBSD 2.0G (ORCUS32) #2: Wed Jul 14 11:00:31 CEST 2004 kardel@Orcus:/usr/obj/sys/arch/i386/compile.i386/ORCUS32 i386
Last somewhat usable kernel. Problem is in kernel araund 20040831
Architecture: i386
Machine: i386
>Description:
Given a suboptimal IDE disk installation that undoubtably needs to be fixed, the
IDE driver experiences interface CRC errors.
In -current kernels around 20040714 this lead to following error recovery
sequence:
wd4: transfer error, downgrading to Ultra-DMA mode 4
wd4(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA data transfers)
wd4p: error reading fsbn 18535488 of 18535488-18535519 (wd4 bn 417660879; cn 414346 tn 1 sn 48), retrying
wd4: (aborted command, interface CRC error)
wd4: transfer error, downgrading to Ultra-DMA mode 3
wd4(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 3 (using DMA data transfers)
wd4: soft error (corrected)
wd4: transfer error, downgrading to Ultra-DMA mode 2
wd4(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data transfers)
wd4: transfer error, downgrading to Ultra-DMA mode 1
wd4(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 1 (using DMA data transfers)
wd4: transfer error, downgrading to PIO mode 4
wd4(viaide0:0:0): using PIO mode 4

Dropping down to PIO mode 4 was not really great, but the system remained usable.

More recent kernels (20040831) exhibit a really problematic behaviour. Error
recovers seems to stop after :
wd4: transfer error, downgrading to Ultra-DMA mode 4
wd4(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA data transfers)

no more disk activity, flushing the disk cache fails on reboot. disk accesses
(like fsck) hang. file system eventually hangs with vnlocks - basically this
leads to a slowly locking up system.

The previos behavior was better. And the best would be to get rid of those
interface CRC errors altogether and to fix error recovery.


dmesg from last somewhat usable kernel:
NetBSD 2.0G (ORCUS32) #2: Wed Jul 14 11:00:31 CEST 2004
	kardel@Orcus:/usr/obj/sys/arch/i386/compile.i386/ORCUS32
total memory = 2047 MB
avail memory = 1996 MB
BIOS32 rev. 0 found at 0xf0010
mainbus0 (root)
cpu0 at mainbus0: (uniprocessor)
cpu0: AMD Unknown K7 (Athlon) (686-class), 2004.68 MHz, id 0xf58
cpu0: features 78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 78bfbff<PGE,MCA,CMOV,PAT,PSE36,MPC,MMX>
cpu0: features 78bfbff<FXSR,SSE,SSE2>
cpu0: "AMD Opteron(tm) Processor 146"
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
ppb0 at pci0 dev 6 function 0: Advanced Micro Devices AMD8111 I/O Hub (rev. 0x07)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
ohci0 at pci1 dev 0 function 0: Advanced Micro Devices AMD8111 USB Host Controller (rev. 0x0b)
ohci0: interrupting at irq 9
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: Advanced Micro OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 3 ports with 3 removable, self powered
ohci1 at pci1 dev 0 function 1: Advanced Micro Devices AMD8111 USB Host Controller (rev. 0x0b)
ohci1: interrupting at irq 9
ohci1: OHCI version 1.0, legacy support
usb1 at ohci1: USB revision 1.0
uhub1 at usb1
uhub1: Advanced Micro OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 3 ports with 3 removable, self powered
ahc1 at pci1 dev 3 function 0: Adaptec 29160 Ultra160 SCSI adapter
ahc1: interrupting at irq 5
ahc1: aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc1: 16 targets, 8 luns per target
ex0 at pci1 dev 7 function 0: 3Com 3c905C-TX 10/100 Ethernet with mngmt (rev. 0x78)
ex0: interrupting at irq 11
ex0: MAC address 00:0a:5e:06:2c:62
exphy0 at ex0 phy 24: 3Com internal media interface
exphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
pdcide0 at pci1 dev 9 function 0
pdcide0: Promise Ultra133/ATA Bus Master IDE Accelerator (rev. 0x02)
pdcide0: bus-master DMA support present
pdcide0: primary channel configured to native-PCI mode
pdcide0: using irq 5 for native-PCI interrupt
atabus0 at pdcide0 channel 0
pdcide0: secondary channel configured to native-PCI mode
atabus1 at pdcide0 channel 1
pdcide1 at pci1 dev 10 function 0
pdcide1: Promise Ultra133/ATA Bus Master IDE Accelerator (rev. 0x02)
pdcide1: bus-master DMA support present
pdcide1: primary channel configured to native-PCI mode
pdcide1: using irq 10 for native-PCI interrupt
atabus2 at pdcide1 channel 0
pdcide1: secondary channel configured to native-PCI mode
atabus3 at pdcide1 channel 1
vga1 at pci1 dev 11 function 0: ATI Technologies Rage XL (rev. 0x27)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
satalink0 at pci1 dev 12 function 0
satalink0: Silicon Image SATALink 3114 (rev. 0x02)
satalink0: 33MHz PCI bus
satalink0: bus-master DMA support present
satalink0: using irq 10 for native-PCI interrupt
atabus4 at satalink0 channel 0
atabus5 at satalink0 channel 1
atabus6 at satalink0 channel 2
atabus7 at satalink0 channel 3
bge0 at pci1 dev 13 function 0: Broadcom BCM5705 Gigabit Ethernet
bge0: interrupting at irq 9
bge0: ASIC BCM5705 A3 (0x3003), Ethernet address 00:e0:81:60:3b:11
brgphy0 at bge0 phy 1: BCM5705 1000BASE-T media interface, rev. 2
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge1 at pci1 dev 14 function 0: Broadcom BCM5705 Gigabit Ethernet
bge1: interrupting at irq 5
bge1: ASIC BCM5705 A3 (0x3003), Ethernet address 00:e0:81:60:3b:12
brgphy1 at bge1 phy 1: BCM5705 1000BASE-T media interface, rev. 2
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
pcib0 at pci0 dev 7 function 0
pcib0: Advanced Micro Devices AMD8111 LPC Controller (rev. 0x05)
viaide0 at pci0 dev 7 function 1
viaide0: Advanced Micro Devices AMD8111 IDE Controller (rev. 0x03)
viaide0: bus-master DMA support present
viaide0: primary channel configured to compatibility mode
viaide0: primary channel interrupting at irq 14
atabus8 at viaide0 channel 0
viaide0: secondary channel configured to compatibility mode
viaide0: secondary channel interrupting at irq 15
atabus9 at viaide0 channel 1
Advanced Micro Devices AMD8111 SMBus Controller (SMBus serial bus, revision 0x02) at pci0 dev 7 function 2 not configured
Advanced Micro Devices AMD8111 ACPI Controller (miscellaneous bridge, revision 0x05) at pci0 dev 7 function 3 not configured
pchb0 at pci0 dev 24 function 0
pchb0: Advanced Micro Devices AMD64 HyperTransport configuration (rev. 0x00)
pchb1 at pci0 dev 24 function 1
pchb1: Advanced Micro Devices AMD64 Address Map configuration (rev. 0x00)
pchb2 at pci0 dev 24 function 2
pchb2: Advanced Micro Devices AMD64 DRAM configuration (rev. 0x00)
pchb3 at pci0 dev 24 function 3
pchb3: Advanced Micro Devices AMD64 Miscellaneous configuration (rev. 0x00)
isa0 at pcib0
lpt0 at isa0 port 0x378-0x37b irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0 mux 0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
isapnp0: no ISA Plug 'n Play devices found
Kernelized RAIDframe activated
IPsec: Initialized Security Association Processing.
scsibus0: waiting 2 seconds for devices to settle...
st0 at scsibus0 target 5 lun 0: <HP, C5713A, H910> tape removable
st0: density code 37, variable blocks, write-enabled
ch0 at scsibus0 target 5 lun 1: <HP, C5713A, H910> changer removable
ch0: 6 slots, 1 drive, 0 pickers, 0 portals
st0: sync (50.00ns offset 32), 16-bit (40.000MB/s) transfers
ch0: sync (50.00ns offset 32), 16-bit (40.000MB/s) transfers
wd0 at atabus0 drive 0: <HDS722525VLAT80>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 232 GB, 484521 cyl, 16 head, 63 sec, 512 bytes/sect x 488397168 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(pdcide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
wd1 at atabus1 drive 0: <HDS722525VLAT80>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 232 GB, 484521 cyl, 16 head, 63 sec, 512 bytes/sect x 488397168 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(pdcide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
wd2 at atabus2 drive 0: <HDS722525VLAT80>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd3 at atabus2 drive 1: <HDS722525VLAT80>
wd3: drive supports 16-sector PIO transfers, LBA48 addressing
wd3: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
wd3: 32-bit data port
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd2(pdcide1:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
wd3(pdcide1:0:1): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
wd4 at atabus8 drive 0: <HDS722525VLAT80>
wd4: drive supports 16-sector PIO transfers, LBA48 addressing
wd4: 232 GB, 484521 cyl, 16 head, 63 sec, 512 bytes/sect x 488397168 sectors
wd4: 32-bit data port
wd4: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd4(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
wd5 at atabus9 drive 1: <HDS722525VLAT80>
wd5: drive supports 16-sector PIO transfers, LBA48 addressing
wd5: 232 GB, 484521 cyl, 16 head, 63 sec, 512 bytes/sect x 488397168 sectors
wd5: 32-bit data port
wd5: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd5(viaide0:1:1): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
warning: double match for boot device (wd0, wd1)
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
raid2: ...
>How-To-Repeat:
	Have a suboptiomal IDE disk connection (interface CRC errors). (Any hints
	on remedying that problem? - I am already through some cable combinations.)
	Try to do some fast disk IO - like fsck in UFS2.
	Watch the driver downgrade and hang.
>Fix:
	Fix connection so the CRC interface errors don't happen (haven't been sucessful 
	at that yet - hints? power/cable/connectors ?)
	Take disks out of servce - not really feasible.
	Use kernel from around 20040714.
	Find bug in driver during error recovery.
>Release-Note:
>Audit-Trail:
>Unformatted: