Subject: kern/31826: satalink(4) channel reset fails
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <tnn@netilium.org>
List: netbsd-bugs
Date: 10/15/2005 14:31:00
>Number:         31826
>Category:       kern
>Synopsis:       satalink(4) channel reset fails
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Oct 15 14:31:00 +0000 2005
>Originator:     Tobias Nygren
>Release:        HEAD
>Organization:
>Environment:
NetBSD chiyo.nygren.pp.se 3.99.9 NetBSD 3.99.9 (GENERIC.MP) #0: Mon Oct 10 21:55:52 CEST 2005  tnn@chiyo.nygren.pp.se:/scratch/src/work/obj-alpha/sys/arch/alpha/compile/GENERIC.MP alpha
>Description:
I recently got the same error as described by wiz@ in http://mail-index.netbsd.org/current-users/2005/08/14/0005.html so
I decided to PR it.
As shown below, it begins with a lost intr, followed by another and an
unsuccessful attempt to reset the channel. Then it goes into
an endless loop trying to reset the channel.
This drive is part of a raidframe mirror set, using two separate
sil3114 controllers on separate PCI busses. Because it goes into
a reset-loop it doesn't report a hard error that would cause
raidframe to fail the drive. Instead it locks up all disk I/O.

This happened on a quite exotic AlphaServer system, but I've
seen it once on a regular Dell Pentium 4 desktop, and thus I
believe this is a problem with the satalink driver.

satalink0 at pci0 dev 2 function 0
satalink0: Silicon Image SATALink 3114 (rev. 0x02)
satalink0: 33MHz PCI bus
satalink0: bus-master DMA support present
satalink0: using kn300 irq 40 for native-PCI interrupt
atabus0 at satalink0 channel 0
atabus1 at satalink0 channel 1
atabus2 at satalink0 channel 2
atabus3 at satalink0 channel 3
satalink1 at pci1 dev 5 function 0
satalink1: Silicon Image SATALink 3114 (rev. 0x02)
satalink1: 33MHz PCI bus
satalink1: bus-master DMA support present
satalink1: using kn300 irq 20 for native-PCI interrupt
atabus4 at satalink1 channel 0
atabus5 at satalink1 channel 1
atabus6 at satalink1 channel 2
atabus7 at satalink1 channel 3
satalink0: port 3: device present, speed: 1.5Gb/s
satalink1: port 2: device present, speed: 1.5Gb/s
[...]
wd0 at atabus3 drive 0scsibus1: waiting 2 seconds for devices to settle...
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 279 GB, 581463 cyl, 16 head, 63 sec, 512 bytes/sect x 586114704 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4satalink1: port 3: device present, speed: 1.5Gb/s
wd0(satalink0:3:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
wd1 at atabus6 drive 0: <Maxtor 6L300S0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 279 GB, 581463 cyl, 16 head, 63 sec, 512 bytes/sect x 586114704 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(satalink1:2:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
wd2 at atabus7 drive 0: <Maxtor 7L300S0>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 279 GB, 581463 cyl, 16 head, 63 sec, 512 bytes/sect x 586114704 sectors
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(satalink1:3:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
[...]
raid0: RAID Level 1
raid0: Components: /dev/wd0e /dev/wd2e[**FAILED**]
raid0: Total Sectors: 586113600 (286188 MB)
root on raid0a dumps on raid0b
raid0: Error re-writing parity!
raid0: initiating in-place reconstruction on column 1
[...]
satalink1:3:0: lost interrupt
        type: ata tc_bcount: 20480 tc_skip: 0
satalink1:3:0: bus-master DMA error: missing interrupt, status=0x21
wd2e: DMA error writing fsbn 166095744 of 166095744-166095783 (wd2 bn 166096752g
wd2: soft error (corrected)
satalink1:3:0: lost interrupt
        type: ata tc_bcount: 32768 tc_skip: 0
satalink1:3:0: bus-master DMA error: missing interrupt, status=0x20
satalink1:3:0: device timeout, c_bcount=32768, c_skip0
wd2e: device timeout reading fsbn 256327616 of 256327616-256327679 (wd2 bn 2563g
satalink1 channel 3: reset failed for drive 0
satalink1:3:0: wait timed out
wd2e: device timeout reading fsbn 256327616 of 256327616-256327679 (wd2 bn 2563g
satalink1 channel 3: reset failed for drive 0
[...repeat timeout/reset failed forever]

Some lines are a bit truncated because I got them from the scrollback of a serial console.

How can I debug this problem further?
From what I can tell ata_reset_channel() calls atastart(),
wipes the ch_queue and issues a reset to the drive?
Would it be possible to reset parts of or the entire controller
if there is no response?
>How-To-Repeat:
Maybe wiggle the SATA-connector to cause a transfer error? :)



>Fix: