Subject: Re: bufcache cancer in -current?
To: None <frank@wins.uva.nl, smd@ebone.net>
From: Sean Doran <smd@ebone.net>
List: current-users
Date: 05/23/2000 03:15:16
Incidentally, the suffering partition is on

sd1 at scsibus2 target 6 lun 0: <QUANTUM, QM318000TD-SW, N491> SCSI2 0/direct fixed
sd1: 17366 MB, 8057 cyl, 20 head, 220 sec, 512 bytes/sect x 35566500 sectors

and this drive is one of those wonderful Quantums that have tag
queueing woes.  The symptom under the old driver: exhaust tags on
disk, disk gives QUEUE FULL condition, and drive never reconnects.
Result: hanged disk, no unhang until reboot.

It strikes me that the sync loop and the changing block sizes
would affect the time-of-hang in the old driver in much the same
way as the time-of-corruption.

Is it possible that the new driver recovers from QUEUE FULL + non-
reconnect in a way that triggers lossage?

	Sean.

P.S.: Oh I found some ancient stuff from Justin Gibbs (27 May 1998):

>The Atlas II has a tendency to return QUEUE FULL status when its write 
>cache fills.  Julian E. derived SCSI subsystems simply don't have a clean
>way to stop the queue of transactions to the device and requeue a transaction
>that returns BUSY or QUEUE FULL status for a retry.  Instead, the code,
>while in an interrupt context, sends the transaction again immediately.  This
>simply won't work well with the Atlas II regardless of the firmware level
>(LKY8 is better than LXY4, but it doesn't make the problem go completely
>away) and is less than ideal for other drives too.  The correct algorithm
>is to stop the flow of transactions to the device when a QUEUE FULL condition
>occurs, lower the tag count to the number of currently active transaction,
>and release the queue of transactions once a command completes.  For drives
>like the Atlas II that return queue full for temporary resource shortages,
>you need a quirk entry that prevents the count from going too low.
>
>Now, you can kludge up the controller driver to do the re-queue, but rather
>than fix it for only one driver when so many controllers have the potential
>to support tagged queuing (bt, ahb, aic, uha, etc.), I added support in
>the CAM SCSI layer for properly dealing with these kinds of issues.  So,
>even the FreeBSD-current version of the aic7xxx driver, although it has
>many bug fixes the NetBSD driver does not, will still suffer from this
>problem.  The only fix I know of for these drives that is available today,
>is to run CAM under either FreeBSD-current or FreeBSD-stable.