current-users: Trouble with disk reads (writes OK)

Subject: Trouble with disk reads (writes OK)
To: None <current-users@NetBSD.ORG>
From: Sverre Froyen <sfroyen@cyberneering.com>
List: current-users
Date: 08/19/1995 21:32:33
PROBLEM:
I have a pc532 running NetBSD current (as of August 13).  On its
SCSI bus, I have a Seagate 1GB drive on ID 0, a Teac floppy drive
on ID 2, and a Syquest removable hard drive on ID 3. ID 2 is
empty.  The Seagate and Teac appear to work OK, but the Syquest
occasionally returns bad data.  Writes to the Syquest appear to
work fine.

FACTS (sort of):
My initial investigation shows that the read errors are not driver
related.  My tests, which involve reading the same file again and
again shows that the block containing the bad data is never read.
By this I mean is that the SCSI driver receives no command to read
that block.  Also, on good transfers,  the preceeding read is always
one block longer than on the reads that fail.  Thus, it looks like
the preceeing read should have been one block longer.

SPECULATION:
This leads me to suspect the code in vfs_cluster.c.  It looks like the
cluster code concatenates sequential blocks into a single read operation
and somehow miscalculates the length.  Note that the error is intermittent
so it is probably related to some piece of code that is unprotected
from interrupts, but should be protected.

MORE FACTS:
A further piece of evidence is that sometimes (but not always) the
read errors cause (?) a "panic: cluster_rbuild: too much memory".

Below I have attached a disk activity trail that illustrates the
error.  I'll keep looking into this problem but I would appreciate
any bug fixes, ideas, or other information that might help me.

Thanks,

Sverre Froyen

##################################################################
(abbor.nrel.gov): cmp -l csh /mnt/csh | head
##################################################################
# SCSI transfer from Syquest that fails.
# Legend:
#	command -- command phase entered
#	msgin   -- message in phase entered, with result
#	data    -- data-in phase entered, with three bytes of data
#	           at start of every 4096 byte block, length before
#	           and after call to transfer_pdma, scsi id, and
#	           phase after transfer.
##################################################################
command-msgin = 02
        msgin = 04
        data = 00 0211 01 ... len = 4096/0, id = 3, phase = 8
status-msgin = 00
command-data = 0337 0330 0300 ... len = 4096/0, id = 3, phase = 8
status-msgin = 00
command-msgin = 02
        msgin = 04
        data = 0212 0144 032 ... len = 8192/0, id = 3, phase = 8
        data = 0351 0134 00 ...
status-msgin = 00
command-data = 0266 0374 0137 ... len = 12288/0, id = 3, phase = 8
        data = 0146 0145 0162 ...
        data = 02 0300 02 ...
status-msgin = 00
command-data = 012 054 0347 ... len = 16384/0, id = 3, phase = 8
        data = 035 030 012 ...
        data = 0147 0327 0245 ...
        data = 014 0177 0245 ...
status-msgin = 00
command-data = 037 0300 0263 ... len = 4096/0, id = 3, phase = 8
status-msgin = 00
command-data = 030 030 0127 ... len = 16384/0, id = 3, phase = 8
        data = 00 07 05 ...
        data = 02 0312 024 ...
        data = 035 030 0243 ...
status-msgin = 00
command-data = 0141 0143 0153 ... len = 12288/0, id = 3, phase = 8
        data = 07 035 00 ...
        data = 010 025 0120 ...
status-msgin = 00
##################################################################
# One 4096 block is missing here.  In transfers that are OK, the
# preceeding request is 4096 bytes longer, that is, 16384 bytes.
# In several tests this (a shorter preceeding block) seems to be the case
##################################################################
command-data = 0227 0300 0140 ... len = 16384/0, id = 3, phase = 8
        data = 0143 030 07 ...
        data = 00 00 054 ...
        data = 00 012 0203 ...
status-msgin = 00