netbsd-help: Re: SCSI error code 0, 64, 67, 70

Subject: Re: SCSI error code 0, 64, 67, 70
To: Mark Willey <willeyma@expert.cc.purdue.edu>
From: Brian Tao <taob@io.org>
List: netbsd-help
Date: 06/16/1996 15:56:56
On Sun, 16 Jun 1996, Mark Willey wrote:
>
> So, did you add a new drive or no?  I'm confused that you say that no
> hardware has changed, but you're putting in a new drive.

    Sorry, I'll try this again.  :)  Between January 21 and June 15,
there were no hardware or software changes to the server.  Then I saw
these lines in the syslog, corresponding to one kernel panic each
(free: multiple frees):

Jun 15 12:22:15 nfs /netbsd: sd4(ncr1:4:0): error code 70
Jun 15 12:58:35 nfs /netbsd: sd4(ncr1:4:0): error code 0
Jun 15 13:03:30 nfs /netbsd: sd4(ncr1:4:0): error code 0

    sd4 is one of 8 SCSI devices on the server (one boot drive and one
tape drive on the first controller, 6 drives on the second
controller).  It, along with the others, were installed back in
January.  It looks like a hardware problem, with it suddenly showing
up like this after months of flawless operation, but as I said, I
couldn't see anything abnormal on a visual inspection.

    Fearing that the drive may become completely unusable in the near
future, I installed a replacement drive on the first controller.  This
is a new drive, on a different controller than the first, but it still
gave the following errors when completing a newfs or attempting to
mount sd1a:

Jun 15 16:17:27 nfs /netbsd: sd1(ncr0:2:0): error code 70
Jun 15 16:17:59 nfs /netbsd: sd1(ncr0:2:0): error code 70
Jun 15 16:17:59 nfs /netbsd: sd1(ncr0:2:0): error code 70
Jun 15 16:48:42 nfs /netbsd: sd1(ncr0:2:0): error code 70
Jun 15 16:48:44 nfs /netbsd: sd1(ncr0:2:0): error code 67 at block no. 1826690 (decimal)
Jun 15 16:48:44 nfs /netbsd: sd1(ncr0:2:0): error code 67 at block no. 1826690 (decimal)
[etc...]

    It just occurred to me that AWRE and ARRE may not have been
enabled at the factory on this drive.  How do I turn it on from
NetBSD?  'man -k scsi' did not appear to suggest any commands I could
use.

    Anyhow, I was able to mount sd1a later on that evening after power
cycling the machine again.  I was able to fill the filesystem by
dd'ing /dev/zero to a file, without any reported errors.  I think
dump/restored the filesystem on the old drive to the new one.  There
was an error reported on the old drive, but it did not cause a panic:

Jun 15 22:22:00 nfs /netbsd: sd5(ncr1:4:0): medium error, info = 6451681 (decimal), data = 00 00 00 00 11 00 00 80 00 09
Jun 15 22:22:01 nfs /netbsd: sd5(ncr1:4:0): medium error, info = 6451681 (decimal), data = 00 00 00 00 11 00 00 80 00 09

    The filesystem was migrated over cleanly, sd5 was taken out of
/etc/fstab, sd1 mounted in its place, and the machine rebooted.  So
far, the server has been up for 16.5 hours without incident (compared
to ~1 uptimes with the bad drive).

> My advice is to check for SCSI ID conflicts, termination, etc.

    There were no ID conflicts, termination was provided properly, the
total bus length is under a meter total, etc.  It was the errors
reported on the new drive that threw me off, but that seems to have
cleared up, *shrug*.  I would still like to know the descriptions of
those error codes though.

> The disklabel may be different.  Possibly, the fact that the fs on
> NetBSD is 64-bit may cause a problem.  (random guess)

    I haven't had any interoperability problems between FreeBSD and
NetBSD filesystems yet.  The new disk was fdisk'd on a FreeBSD
machine, but disklabelled and newfs'd on the NetBSD server.
--
Brian Tao (BT300, taob@io.org, taob@ican.net)
Systems and Network Administrator, Internet Canada Corp.
"Though this be madness, yet there is method in't"