tech-userlevel: Re: SCSI device tuning tool

Subject: Re: SCSI device tuning tool
To: NetBSD Userlevel Technical Discussion List <tech-userlevel@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 01/20/2001 21:33:20
[ On Sunday, January 14, 2001 at 18:22:15 (-0500), der Mouse wrote: ]
> Subject: Re: SCSI device tuning tool
>
> > A surprising number of disks come from the manufacturer without
> > bad-sector reallocation enabled by default [...] and I still think
> > the NetBSD SCSI driver should always force AWRE and ARRE to be on all
> > of the time (except maybe for AWRE if the device is only open
> > read-only),
> 
> I don't.  If for any reason I have a disk set to no auto reallocation,
> I do not want the driver to take it upon itself to assume that's a
> mistake on my part.

Sorry, but if the driver's not explicitly going to handle the errors in
the best possible way then it should endeavour to make sure that they
are handled at the lower level!  I think the default behavour MUST be
the safe behaviour.  (See below for non-default alternatives!)

>  In particular, that would make it impossible to do
> diagnostic surface scans with the likes of sdd (or at least it would be
> much less useful to try to).

As I've already hinted I think the driver should only turn these bits on
when a filesystem is mounted (I should have used that more correct word
rather than "open").

Given that logic there won't be anything to prevent you from doing a
"passive" analysis.  Note that if you're doing surface scans with a
filesystem mounted then something's wrong!

> Printing a warning, fine.  Changing it, not fine.

I may be mistaken but I was under the distinct impression that all SCSI
disks returned a warning when they "fixed" a block anyway, so this is
presumably already done.  Certainly if there's a hard error on read then
even if ARRE is enabled the data is lost and the driver's got to fail
the read request.  However even if the data's been reconstructed from
parity/ECC and been safely reallocated and returned I've still seen the
errors be reported.  For example (I *think* this is from a read):

Oct 25 22:11:06 proven /netbsd: sd0(ahc0:0:0):  Check Condition on CDB: 0x2a 00 00 87 7c 46 00 00 02 00
Oct 25 22:11:07 proven /netbsd:     SENSE KEY:  Recovered Error
Oct 25 22:11:07 proven /netbsd:    INFO FIELD:  8879174
Oct 25 22:11:08 proven /netbsd:  COMMAND INFO:  400753985 (0x17e30541)
Oct 25 22:11:08 proven /netbsd:      ASC/ASCQ:  ASC 0x03 ASCQ 0xa4
Oct 25 22:11:08 proven /netbsd:          SKSV:  Actual Retry Count: 30

or this one?

Dec 16 14:13:11 proven /netbsd: sd0(ahc0:0:0):  Check Condition on CDB: 0x28 00 00 85 10 2c 00 00 02 00
Dec 16 14:13:11 proven /netbsd:     SENSE KEY:  Recovered Error
Dec 16 14:13:11 proven /netbsd:    INFO FIELD:  8720428
Dec 16 14:13:11 proven /netbsd:  COMMAND INFO:  390463496 (0x17460008)
Dec 16 14:13:11 proven /netbsd:      ASC/ASCQ:  ASC 0x17 ASCQ 0xc1
Dec 16 14:13:11 proven /netbsd:          SKSV:  Actual Retry Count: 25


(I've been carefully watching that disk and hoping it doesn't go real
bad all at once!)

If you mean to print a warning when the ARRE and AWRE bits are not set
as one would logically have them set for a production system, then yes
I think this should be a minimum requirement, but it's not safe enough.

Alternately there could be a driver flag set at compile time or with gdb
(and perhaps at run-time through a sysctl) that could disable automatic
enabling of the ARRE and AWRE bits -- this way everyone would have the
default "safe" conditions and yet hardware hackers could disable it as
necessary.

In some ways this is sort of the same kind of issue as the one of
whether or not to force the system to wait for RAID parity/mirror
rebuilds before allowing writes to occur.  Just as with any other
"security" issue, I think integrity must take a first-row "always on by
default" position and those who don't like it that way can be free to
take the risk of putting it at the back of the bus instead.


Now that I've thought about this more I believe I've only seen these
bits disabled by default on modern drives that are labeled as "AV"
(i.e. multi-media) or RAID drives where spindle sync and head sync is
critical -- reallocated blocks would break the sync and could blow
performance to shreds.  However in true NetBSD tradition there should be
clean and safe ways of using such drives for other purposes!  :-)

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>