Subject: kern/35008: viaide.c v1.35 sometimes fails horribly
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <perry@piermont.com>
List: netbsd-bugs
Date: 11/07/2006 14:40:00
>Number:         35008
>Category:       kern
>Synopsis:       viaide.c v1.35 sometimes fails horribly
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Nov 07 14:40:00 +0000 2006
>Originator:     Perry E. Metzger
>Release:        NetBSD 4.99.3
>Organization:
	
>Environment:
	
	
System: NetBSD ein.piermont.com 4.99.3 NetBSD 4.99.3 (ZWEI) #1: Mon Nov 6 21:29:02 EST 2006 perry@ein.piermont.com:/usr/src/sys/arch/amd64/compile/ZWEI amd64
Architecture: x86_64
Machine: amd64
>Description:

I'm running on an amd64 box with a viaide SATA controller. With ACPI
not on in the kernel, both version 1.34 and version 1.35 of viaide.c
lead to periodic failures to boot (perhaps one in every five times),
with the driver spewing errors during boot and failing to read the
disk.

However, this PR is about the behavior with ACPI turned on.

Version 1.35 leads to failure about one in every five to ten reboots.
I get lots of messages, most of which scroll off the screen,
preventing me from writing them down. :(

This is what was left on the screen that I could type in by hand:

[...]
: <ST506>
wd0: drive supports 1-sector PIO transfers, chs addressing
[note: this is a modern drive and does fine most reboots.]
wd0: 69632 KB, 1024 cyl, 8 head, 17 sec, 512 bytes/sect x 139264 sectors
[that's totally wrong of course, and it works on most boots.]
[then we have a bunch of unimportant junk, and then...]
wd0(viaide1:0:0): using PIO mode 0
viaide1:0:0: wait timed out
wd0d: device timeout reading fsbn 0 (wd0 bn 0; cn 0 tn 0 sn 0), retrying
wd0: soft error (corrected)
wd0: mbr partition exceeds disk size
wd0: mbr partition exceeds disk size
wd0: mbr partition exceeds disk size
wd0: mbr partition exceeds disk size
boot device: <unknown>
root device:

[and then it wants me to type in a boot device.]

Version 1.34 of viaide.c works fine -- I've tried rebooting about 30
times without failure.

Note that the behavior without ACPI on is also disturbing -- it
appears that sometimes the chip doesn't get properly initialized by
the driver. This may be some variation on that theme. However, if we
can at least fix this regression, I can limp along...

>How-To-Repeat:

It might be difficult without my particular mobo and chip set, or it
might happen with any hardware. With my boxes, it is pretty
straightforward, if labor intensive, to reproduce. Note that I get the
same symptoms on multiple boxes.

>Fix:

My guess is that something in v1.35 depends on the state the chip is
in on boot and that every once in a while it is not in that state. (I
also suspect this is the problem without ACPI turned on, but that ACPI
initializes things "better"). My guess is that the problem can be
fixed by cleanly initializing the chip during boot, but I have no real
knowledge of the chip in question.

>Unformatted: