Subject: kern/3021: removable disk UNIT ATTENTION requires reboot!
To: None <gnats-bugs@gnats.netbsd.org>
From: John F. Woods <jfw@jfwhome.funhouse.com>
List: netbsd-bugs
Date: 12/11/1996 23:07:34
>Number:         3021
>Category:       kern
>Synopsis:       removable disk UNIT ATTENTION requires reboot!
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Dec 12 17:20:01 1996
>Last-Modified:
>Originator:     John F. Woods
>Organization:
Misanthropes-R-Us
>Release:        1.2_BETA
>Environment:
	
System: NetBSD jfwhome.funhouse.com 1.2 NetBSD 1.2 (JFW) #5: Sun Nov 3 10:33:41 EST 1996 jfw@jfwhome.funhouse.com:/usr/src/sys/arch/i386/compile/JFW i386


>Description:
I have a SyQuest EZ-135 drive.  It appears to be prone to occasional UNIT
ATTENTION events.  The SCSI driver conspires to require a reboot after this
happens when the drive is mounted:

Any UNIT ATTENTION on a removable drive is assumed to be a media change.
This is quite incorrect; the driver should check the Additional Sense Code
and Qualifier fields to determine if the media has changed.  (The description
of UNIT ATTENTION in the SCSI-2 document is "Indicates that the removable
medium may have been changed or the target has been reset.  See 7.9 for more
detailed information about the unit attention condition."  If you check 7.9,
you see a laundry list of events that result in UNIT ATTENTION, the last of
which is the ever-popular "Any other event occurs that requires the attention
of the initiator."  In my particular case, the drive is reporting ASC/Q of
29 00, or "POWER ON, RESET, OR BUS DEVICE RESET OCCURRED".  Presumably this
was a power glitch that the SyQuest's wall-wart couldn't handle (I live in
semi-rural Massachusetts, where a few days ago a massive winter storm wiped
out power lines all over the place; they're STILL fixing them).  I suppose
that if someone turns off the power on a removable device, there's no
particular guarantee that they won't remove the cartridge (though with the
SyQuest you have to use a paper clip to engage the release mechanism...), but
it's certainly not a direct indication that the medium has changed.

I would argue that only ASC/Q 28 00 directly implies that the medium has
changed; maybe 04 00, 04 01, and 00 00 entitle you to be suspicious that it
may have, since they are explicit declarations of ignorance by the drive.

But that's only half the bug -- that the driver inappropriately invalidates
the medium.  The other half is an annoying interaction between the driver
and mount.

Read through sd.c and note that once the medium is marked as not loaded,
further read and write requests are refused until the open count of the device
goes to zero.  A mounted filesystem is, of course, represented by an open of
the device; and if you check the unmount code, you find that if it can't write
metadata, it won't unmount the disk.  No unmount, no close, no writes;
no writes, no unmount, no close:  nice vicious loop there.

The unmount -force flag does not sufficiently affect the above.

>How-To-Repeat:
Borrow my SyQuest drive, mount it, let it sit idle with mildly flaky power.
Or just read lots of code knowing what the problem is supposed to be.

>Fix:
I don't have any concrete fixes.  I would urge that the test in
scsi_interpret_sense be changed to something like

	if ((sc_link->flags & SDEV_REMOVABLE) != 0 &&
            (sense->extra_len >= 6 &&
	     (sense->add_sense_code == 0x28 ||
              sense->add_sense_code == 0x04 ||
              sense->add_sense_code == 0x00)  &&
             sense->add_sense_code_qual == 0x00))
                sc_link->flags &= ~SDEV_MEDIA_LOADED;
 
(though I think in my source I'll just test for != 0x29)

As to the interaction between unmount and media changes, that's a lot harder.
What would be really cool would be for the partition map to contain a random
number that would be almost certain to be different for different media, which
the driver could attempt to use to determine if the media was changed or if
the drive just cycled (come on, MacOS can hack it, even (I think) recent
Microsoft operating systems can hack it, why can't we?).  Failing that, the
force-umount flag should be enhanced to abandon a mount even in the face of
write errors (or a second umount-DAMMIT flag added, since I assume the force-
umount flag is aimed at umounting in the face of recalcitrant users, not
recalcitrant hardware).
>Audit-Trail:
>Unformatted: