port-mac68k: Re: SCSI question

Subject: Re: SCSI question
To: Allen Briggs <briggs@ninthwonder.com>
From: Donald Lee <donlee_68k@icompute.com>
List: port-mac68k
Date: 01/10/2000 23:31:38
>> Can anyone tell me what the following messages mean?
>> Jan  9 23:47:58 charm /netbsd: dmaintr: discarded 13 b (last transfer was 1534 b).
>> Jan  9 23:47:58 charm /netbsd: esp0: !TC on DATA XFER [intr 10, stat 83, step 4] prevphase 0, resid 5fe
>
>This means, basically, that the esp (Quadra SCSI) driver detected a data
>underrun.  For some reason, the drive ended the transaction early--either
>it got some sort of error or it thought it was finished or something.  I
>don't know what causes this.  A SCSI bus analyzer would make it somewhat
>easier to track down, but I don't have access to any such beast.
>
>> When I stress it, it seems to corrupt data, and then eventually the kernel
>> panics.
>
>Does it seem to corrupt data when you don't get the messages?

Well, it's hard to say.

I ignored the first few of these errors, because they didn't correspond
to user errors.  Kernels sometimes whine about stuff that has little
to do with the real world, if you know what I mean.. ;->

Eventually, though, I had other problems, so I ran a bunch of experiments.

The most reliable way to reproduce these seemed to be to move a lot of
data.  "cp -r /usr <drive-with-problems>" would give me lots of them, and
after doing this (and interrupting it partway through, I'm impatient)
I could do an fsck and find lots of errors.

I settled later on simply doing a "newfs" and I found that unless I cranked
up the "-i" parameter, I could do

	newfs /dev/sd1g
	fsck -f /dev/sd1g

The newfs would give me the kernel messages, and the fsck would give me
at least one error (near the end, where it was updating the bitmaps.)
When I said "y", it would give me the error.  I could run the
fsck as many times as I liked, and I'd get the same error from fsck, and
the same message on the console at the same time.

I experimented with the "-b" and "-i" params on newfs to get it to run
faster (LOTS faster) and to try to avoid errors.  I found that as I cranked
these up to "newfs -b 8192 -i 524288 /dev/sd1g" I could get through an fsck
cleanly, but I'd still get the odd kernel message.

I was theorizing that the bitmaps were too large, and possibly
overwriting something in the kernel, but that appears to be
wrong.  It's quite possible that the Quantum firmware is simply sick.
It's been known to happen.

I may be able to drum up a SCSI analyzer, and look at it.

Can you tell me how to get the kernel to stop when it hits this
condition?  Unless I can do this, it will be hard to analyze.

I'll let you know what I find.

-dgl-