Subject: Re: bufcache cancer in -current?
To: None <email@example.com, firstname.lastname@example.org>
From: Sean Doran <email@example.com>
Date: 05/23/2000 03:15:16
Incidentally, the suffering partition is on
sd1 at scsibus2 target 6 lun 0: <QUANTUM, QM318000TD-SW, N491> SCSI2 0/direct fixed
sd1: 17366 MB, 8057 cyl, 20 head, 220 sec, 512 bytes/sect x 35566500 sectors
and this drive is one of those wonderful Quantums that have tag
queueing woes. The symptom under the old driver: exhaust tags on
disk, disk gives QUEUE FULL condition, and drive never reconnects.
Result: hanged disk, no unhang until reboot.
It strikes me that the sync loop and the changing block sizes
would affect the time-of-hang in the old driver in much the same
way as the time-of-corruption.
Is it possible that the new driver recovers from QUEUE FULL + non-
reconnect in a way that triggers lossage?
P.S.: Oh I found some ancient stuff from Justin Gibbs (27 May 1998):
>The Atlas II has a tendency to return QUEUE FULL status when its write
>cache fills. Julian E. derived SCSI subsystems simply don't have a clean
>way to stop the queue of transactions to the device and requeue a transaction
>that returns BUSY or QUEUE FULL status for a retry. Instead, the code,
>while in an interrupt context, sends the transaction again immediately. This
>simply won't work well with the Atlas II regardless of the firmware level
>(LKY8 is better than LXY4, but it doesn't make the problem go completely
>away) and is less than ideal for other drives too. The correct algorithm
>is to stop the flow of transactions to the device when a QUEUE FULL condition
>occurs, lower the tag count to the number of currently active transaction,
>and release the queue of transactions once a command completes. For drives
>like the Atlas II that return queue full for temporary resource shortages,
>you need a quirk entry that prevents the count from going too low.
>Now, you can kludge up the controller driver to do the re-queue, but rather
>than fix it for only one driver when so many controllers have the potential
>to support tagged queuing (bt, ahb, aic, uha, etc.), I added support in
>the CAM SCSI layer for properly dealing with these kinds of issues. So,
>even the FreeBSD-current version of the aic7xxx driver, although it has
>many bug fixes the NetBSD driver does not, will still suffer from this
>problem. The only fix I know of for these drives that is available today,
>is to run CAM under either FreeBSD-current or FreeBSD-stable.