Subject: rrioerror() and RQDX3 forced error handling
To: None <port-vax@NetBSD.org>
From: Kirk Russell <kirk@ba23.org>
List: port-vax
Date: 07/03/2005 13:29:59
Hello,

I am trying to backup an RD54 drive, with
	- a RQDX3 controller in a BA123 enclosure -- KA640 CPU
	- NetBSD 1.6.2
	- write protect on during the backup
	- disklabel says the drive has 311200 blocks
	- some blocks have "data error (uncorrectable ecc) (code 8, subcode 7)"
	  errors

The "uncorrectable ecc" error always happens with I try to read the
same block.  I am going to assume this is a forced error flag that
will not be cleared until I write back to that block.  For more info, see:
	http://groups-beta.google.com/group/comp.unix.ultrix/msg/012c14d5ff0f01ec
I am not planning to write anything to this drive until I am finished
my backup.

To finish the backup, I decided to use dd's noerror command to continue
processing the rest of disk after any IO errors:
	dd if=/dev/ra0c of=rd54 bs=16b conv=noerror,sync
But read() doesn't get an IO error, with these uncorrectable ecc errors --
read() appears to return zero instead which causes dd to stop processing
the drive.  The backup is too short -- maybe dd cannot distinguish between
EOF and a uncorrectable ecc error.

Here is an example of read() returning zero when it tries to read a
bad block:
	tinvax# dd if=/dev/ra0c of=/dev/null skip=272316
	0+0 records in
	0+0 records out
	0 bytes transferred in 8.120 secs (0 bytes/sec)
	Jul  3 12:57:57 tinvax /netbsd: ra0: drive 0 hard error datagram: unit 0: small disk error, cyl 1068: data error (uncorrectable ecc) (code 8, subcode 7)
	Jul  3 12:57:57 tinvax /netbsd: ra0: data error (uncorrectable ecc) (code 8, subcode 7)

Here is an example of an IO error, when the drive ready button is off.  So,
it should be possible for the kernel to pass these MSCP errors back to
userland:
	tinvax# dd if=/dev/ra0c of=/dev/null skip=272316
	dd: /dev/ra0c: Input/output error
	0+0 records in
	0+0 records out
	0 bytes transferred in 0.360 secs (0 bytes/sec)
	Jul  3 12:59:39 tinvax /netbsd: ra0: not mounted/spun down

I made these changes to rrioerror() to get this IO error to userland:
Index: mscp_disk.c
===================================================================
RCS file: /cvsroot/src/sys/dev/mscp/mscp_disk.c,v
retrieving revision 1.30.10.1
diff -r1.30.10.1 mscp_disk.c
1000a1001,1005
> 	case M_ST_DATAERR:
> 		bp->b_flags |= B_ERROR;
> 		bp->b_error = EIO;
> 		return MSCP_DONE;
>

With this kernel, dd gets an IO error.  Now dd can continue the backup
with dd and conv=noerror,sync:
	tinvax# dd if=/dev/ra0c of=/dev/null skip=272316
	dd: /dev/ra0c: Input/output error
	0+0 records in
	0+0 records out
	0 bytes transferred in 8.260 secs (0 bytes/sec)
	Jul  3 13:14:55 tinvax /netbsd: ra0: drive 0 hard error datagram: unit 0: small disk error, cyl 1068: data error (uncorrectable ecc) (code 8, subcode 7)

Is there another way to clear this RQDX3 forced error?  Does it make sense
that rrioerror() should be passing more IO errors back to userland?

-- 
Kirk Russell            <kirk@ba23.org>            http://www.ba23.org/
Bridlewood Software Testers Guild                  Ottawa Ontario Canada