Subject: hardware diagnostics?
To: None <port-alpha@netbsd.org>
From: Dieter <netbsd@sopwith.solgatos.com>
List: port-alpha
Date: 08/29/2006 09:51:02
I have a 164lx that sometimes runs fine for weeks or months,
and other times crashes at the drop of a hat.

If an Alpha gets a double-bit memory error, what would
NetBSD print?

Does the cache have ECC, or only main memory?

Is there an explanation somewhere of how to run the hardware
diagnostics that SRM provides that is understandable by mere humans?

The current problem provides a bit of a clue.  I'm running a simple
test on a couple of new disks before putting them into service.
Write data to the entire disk, read it back.

DISK=/dev/rwd0c
dd if=$DISK count=$COUNT bs=$BS | hexdump -C

Works fine if the disk contains 0.  But if the disk contains 0xff,
I get a variety of panics and traps.  It is repeatable, and happens
with both disks.  The amount of time before the panic or trap
varies, as does the panic or trap itself.  Writing to the disk
works fine.  The problem happens reading the disk, and is data
dependent.  Smells like a hardware problem.

Disks are Seagate SATA 7200.10 320 GB, connected to Silicon Image
SATALink 3512 on a PCI card.  A similar Seagate disk (7200.8 250 GB)
on the other port of the same controller works fine for normal use.
Kernel is 2.0.2, with the LBA48 quirk patched.

It isn't the SATA card.  I had problems before adding the card.
I think it is something on the mainboard.