Subject: : My NCR problems
To: None <port-i386@NetBSD.ORG>
From: Thor Lancelot Simon <tls@cloud9.net>
List: port-i386
Date: 01/22/1995 13:58:54
Here's some more info on the problems I had with the NCR driver.  Charles
doubtless knows more than he wants to about this already, but I'm sending it
to the list in case anyone else is seeing similar or identical symptoms.

An initial note: this happens even if I rebuild the driver to force the whole
                 bus to run asynchronous.

Every several megabytes written to a device on the SCSI bus, the following
sequence of error messages appears, sometimes with slight variations:

sd0(ncr0:0:0): COMMAND FAILED (9 ff) @(address dependent upon kernel build)
ncr targ 0?: ERROR (80:110:a8) (8/13) @(3603a8:87030000)
ncr: reset by timeout
sd0(ncr0:0:0): COMMAND FAILED (9 ff) @ (same address as above)
sd0(ncr0:0:0): COMMAND FAILED (9 ff) @ (address + 0x200)
sd0(ncr0:0:0:: COMMAND FAILED (9 ff) @ (address + 0x400)
sd0(ncr0:0:0): COMMAND FAILED (9 ff) @ (address + 0x500)

and there's usually an I/O error or else silently nothing at all gets
written to disk, but the system call seems to return.  I've seen
filesystem and swap corruption result from this.  In severe cases, the
machine hangs.  If I tell the driver to run asynchronous, the
following assertion fails after the rest of the stuff happens, and the
kernel hangs:

assert(cp == np->header.cp) which is around line 5800 in ncr.c.

Sometimes an "ncr: must clear fifos" appears in the sequence of errors
listed up above.  Sometimes the ERROR message is first and the first
COMMAND FAILED is second, and sometimes the "ncr: reset by timeout"
happens at a fdifferent point in the sequence.  Sometimes the whole
thing repeats itself every few seconds, and sometimes it doesn't
happen for ten minutes at a stretch.

This has happened to me consistently on two different ASUS SP3G
motherboards.  It's happened to me with different internal scsi
cables.  It's happened with no external devices and with them.  It's
happened when writing to a different disk than the one I usually see
the problem with.  The only thing I haven't tried doing is removing my
usual boot disk (a Micropolis 4110) from the bus entirely, because I
have no other SCSI disk big enough to build a comfortable system on.
The 4110 is now working quite well with a BT946, and in the past has
worked just fine on a BT445 and an Adaptec 1742.

Any ideas, anyone?

Thor