current-users: Odd SCSI Problems: Medium Errors

Subject: Odd SCSI Problems: Medium Errors
To: None <current-users@NetBSD.ORG>
From: Curt Sampson <curt@portal.ca>
List: current-users
Date: 07/23/1996 14:36:28
I need some advice from the SCSI gurus here. I've got a 1.1 system
with a Buslogic PCI SCSI controller (the BT-946C) and five drives
currently:

    bt0 targ 0 lun 0: <HP, C3724S, 4349> SCSI2 0/direct fixed
    sd0 at scsibus0: 1149MB, 3703 cyl, 5 head, 127 sec, 512 bytes/sec
    bt0 targ 2 lun 0: <HP, C3724S, 4349> SCSI2 0/direct fixed
    sd1 at scsibus0: 1149MB, 3703 cyl, 5 head, 127 sec, 512 bytes/sec
    bt0 targ 3 lun 0: <Quantum, XP32150, 81HB> SCSI20/direct fixed
    sd2 at scsibus0: 2050MB, 3907 cyl, 10 head, 107 sec, 512 bytes/sec
    bt0 targ 4 lun 0: <Quantum, XP32150, 81HB> SCSI20/direct fixed
    sd3 at scsibus0: 2050MB, 3907 cyl, 10 head, 107 sec, 512 bytes/sec
    bt0 targ 5 lun 0: <FUJITSU, M2952S-512, 0102> SCSI2 0/direct fixed
    sd4 at scsibus0: 2291MB, 5714 cyl, 5 head, 164 sec, 512 bytes/sec

Now the other day one of the Quantum drives, sd3, which just happens
to be part of my news spool, started producing errors of the form

    sd3(bt0:4:0): medium error, info = 3878640 (decimal),
	data = 00 00 00 00 11 00 00 00 00 00

when reading the drive. I'm getting these quite frequently (once
per minute or more), but there doesn't seem to be any corrolation
in the `info' numbers (which are, I assume, the block where the
error occurred).

I take it the `11' is the first byte returned by the mode sense
command. What's the rest of the data? It doesn't seem to match any
of the packet formats in my copy of _The SCSI Bus and IDE Interface_
(by Schmidt).

So what does this likely mean? Is the drive really failing badly,
or is something else possibly wrong?

It thought the former until, a day later, an HP drive (sd1) started
doing something similar:

    Jul 23 12:25:11 thoth /netbsd: sd1(bt0:2:0): timed out
    Jul 23 12:30:30 thoth /netbsd: sd1(bt0:2:0): medium error,
	info = 313094 (decimal), data =
	00 00 00 00 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60

At any rate I installed the new Fujitsu drive to replace the Atlas
that appeared to be dying, and I've found, oddly enough, that it's
very slow.  Both bonnie and iozone report about 1.5 MB/sec on writes
(8K blocks) and 4 MB/sec on reads. Even the 5400 RPM HP drives
manage 4 MB/sec on both, and the Quantums do about 5-6 MB/sec. The
Fujitsu claims to be a 7200 RPM, 8 ms. average seek time drive.
Is this certain to be the drive, or it could it be a mismatch of
some sort between the controller and the drive that's causing these
terribly slow results? Has the driver changed significantly between
1.1 and 1.2_BETA? Should I try upgrading?

I'm going to get ahold of an NCR controller to see what sort of
difference that makes, if any.

I'm open to any throughts on all of this. It's hard to see the
simultaneous failure of two drives to be just a coincidence, but
from what knowledge of SCSI I have, I can't see any way the failure
could be anywhere but the drive.

cjs

Curt Sampson    curt@portal.ca		Info at http://www.portal.ca/
Internet Portal Services, Inc.	
Vancouver, BC   (604) 257-9400		De gustibus, aut bene aut nihil.