netbsd-bugs: port-i386/36314: Panic with failing SATA disk

Subject: port-i386/36314: Panic with failing SATA disk
To: None <port-i386-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: None <ianh@orange.net>
List: netbsd-bugs
Date: 05/13/2007 14:15:01

>Number:         36314
>Category:       port-i386
>Synopsis:       Panic with failing SATA disk
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun May 13 14:15:00 +0000 2007
>Originator:     ianh@orange.net
>Release:        NetBSD 3.99.14
>Organization:
	
>Environment:
	
	
Architecture: i386
Machine: i386
>Description:
	Having been a NetBSD user since about 0.8 on dozens of machines, I'm devastated to have my first panic:-( I have a raid 1 with 2 SATA disks, and one of them (wd1) has started to fail (logged in messages, and confirmed by smart). Have seen address mark missing as well.

	Hand copied below, it's reproducible if you need more.

	lot's of these lost interrupt (with varying blocks - and soft error corrected) messages before the final panic:

piixide1:0:0 lost interrupt
	type: ata tc_bcount:16384 tc_skip:0
piixide1:0:0 bus-master DMA error : missing interrupt, status=0x21
piixide1:0:0 device timeout, c_bcount=16384, c_skip0
wd1a: device timeout writing fsbn 320556800 of 320556800-320556831 (wd1 bn 320556800; cn=318012 tn=11 sn=11), retrying
piixide1 channel 0 : reset failed for drive 0
kernel: Supervisor trap machine check fault, code=0
Stopped in pid 13.1 (atabus4) at netbsd:wdcreset+0xb4 : outb %al,%dx

db> tr
wdcreset(c139091c,0,cbb0cf3c,caa6b8c4,0) at netbsd:wdcreset+0xb4
wdc_reset_channel(c139091c,20008,caa6cb2c,c139091c,0) at netbsd:wdc_reset_channel+0x53
ata_reset_channel(c139091c,20008,0,296,0) at betnsd:ata_reset_channel+0xbf
atabus_thread(c137f300,956000,95f000,0,c0100321) at netbsd:atabus_thread+0x135

Looks like this is the bus_space_write_1 @ end of wdcreset, based on code inspection and the "reset failed...", not sure what it's trying to achieve at that point, but panicing problem isn't the best option. I will be ordering replacement disk of course, can we do anything useful with this in the meantime?

>How-To-Repeat:
	
>Fix:
	

>Unformatted: