Port-mips archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: scsi timeouts on sgimips indy + zuluscsi
On Tue, 18 Nov 2025 at 18:14, Warner Losh <imp%bsdimp.com@localhost> wrote:
>
>
> On Tue, Nov 18, 2025, 5:51 PM Adrian Chadd <adrian%freebsd.org@localhost> wrote:
>
>> hi!
>>
>> I'm working on the error handling in the wdc driver when it's talking with
>> this zuluscsi SD SCSI emulator thing. (In parallel I'm trying to figure
>> out
>> if I can configure the zuluscsi to hang less.)
>>
>> Anyway, I'm looking for someone who has any idea about ye olde scsi
>> chipsets and the right way to handle this stuff.
>>
>> Here's an example of it hanging and not resetting:
>>
>> ...
>>
>> Status: Command failed
>> Command: /sbin/mount -rt cd9660 /dev/cd0a /mnt2
>> Hit enter to continue[ 177.9141033] sd0(wdsc0:0:1:0): wdsc0: timed
>> out; asr=0x20 [acb 0x97ec4fa8 (flags 0x1, dleft 20)],
>> <state 5, nexus 0x97ec4fa8, resid 20, msg(q 0,o 0)>sd0(wdsc0:0:1:0):
>> ABORT
>> in timeout: csr=0xff, asr=0x20
>> [ 178.1254838] sd0(wdsc0:0:1:0): sending ABORT command
>> [ 178.1840412] sd0(wdsc0:0:1:0): Resetting bus
>> [ 180.1885505] sd0(wdsc0:0:1:0): wdsc0: timed out; asr=0x00 [acb
>> 0x97ec4fa8
>> (flags 0x41, dleft 20)], <state 8, nexus 0x97ec4f
>> a8, resid 20, msg(q 0,o 0)>sd0(wdsc0:0:1:0): ABORT in timeout: csr=0x01,
>> asr=0x00
>> [ 180.4009330] sd0(wdsc0:0:1:0): sending ABORT command
>> [ 180.4594968] sd0(wdsc0:0:1:0): sending DISCONNECT to target
>> [ 183.4353722] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 186.4119961] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 189.3885164] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 192.3650359] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 195.3415581] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 198.3180772] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 201.2945987] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 204.2711138] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 207.2476389] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 210.2241586] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 213.2006735] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 216.1770920] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 219.1536130] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 222.1302349] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>> [ 225.1068634] wd33c93_wait: TIMEO @959 with asr=x0 csr=x1
>>
>> ...
>>
>> Does anyone remember how the ye olde controller works enough to go through
>> the driver and figure out what could be missing with the error handling /
>> recovery?
>>
>> eg - I have one diff already to handle a NULL pointer inside the timeout
>> routine:
>>
>> @@ -2298,7 +2304,7 @@ wd33c93_timeout(void *arg)
>> /* We need to service a missed IRQ */
>> wd33c93_intr(sc);
>> } else {
>> - (void) wd33c93_abort(sc, sc->sc_nexus, "timeout");
>> + (void) wd33c93_abort(sc, acb, "timeout");
>> }
>> splx(s);
>> }
>>
>> sc->sc_nexus is NULL after a disconnect, before the timeout fires, so that
>> would panic. Is using acb there instead "correct" ?
>>
>> Thanks!
>>
>
> What's the CDB of the failing command? Is it the first one sent, or does
> this happen at random? Go ahead with the eye roll on this one: are both
> ends of the bus properly terminated? And it's the right kind (i remember
> hassled from active vs passive). Timeout on mount always has me going
> through all the basics since they aren't as top of mind as the were 30
> years ago.
>
It gets through a fair amount of IO before it craps the bed. But, i've had
to limit the IO size to 4k because if I leave it at maxphys it will die
after the first couple of 64k writes.
(Yes I'll go diagnose that separately. :-) )
I'm still trying to build a mental map of this scsi driver and state in my
head, since i think some of the error conditions have rotted a bit.
eg, the crash is happening because when the device disconnects mid
transfer, the code marks the current transfer as NULL, and doesn't complete
the IO.
Then the callout fires, wd33c93_timeout() is called, sc->sc_nexus is NULL,
so when it calls wd33c93_abort(sc, sc->sc_nexus, "timeout") the routine
deref's a NULL acb pointer and things panic.
-adrian
-adrian
Home |
Main Index |
Thread Index |
Old Index