Port-mips archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

sgimips, wdsc / scsi timeouts and kernel panics



hi!

I may have slipped and bought a couple of SGI indy's and I wanted to
install netbsd on one.

However! It looks like the wdsc/wd33c93 driver does not handle the device
going away on the bus during a transfer.

(And the driver doesn't compile with DEBUG defined too, so I'll at least
need to send out a diff to fix that.)

Eg:

# newfs -O 1 -V 4 /dev/rsd0a
/dev/rsd0a: 3996.0MB (8183744 sectors) block size 16384, fragment size 2048
        using 22 cylinder groups of 181.64MB, 11625 blks, 22912 inodes.
super-block backups (for fsck_ffs -b #) at:
[ 154.0069991] sd0(wdsc0:0:1:0): wdsc0: timed out; asr=0x00 [acb 0x97ec4fa8
(flags 0x1, dleft 10000)], <state 1, nexus 0x0, resid 8000, msg(q 0,o
0)>pid 0(system): trap: cpu0, TLB miss (load or instr. fetch) in kernel mode
[ 154.1577295] status=0xf003, cause=0x108, epc=0x880c1614, vaddr=0
[ 154.1577295] tf=0x88041dc0 ksp=0x88041e60 ra=0x880c1610 ppl=0xf003

That's happening because sc->sc_nexus is NULL, and wd33c93_timeout() calls
wd33c93_abort() on sc->sc_nexus.

sc->sc_nexus is NULL because the device went off the bus. After compiling
with #define DEBUG in the driver and enabling the config bits, i see this:

[ 133.1887976] go[0x1a] next[a=00,c=1a]: wd33c93_xfout {6} 0a 00 20 e0 80
00 00 00 01 00
[ 133.2829149] wd33c93_xfout done: 0 bytes remaining (wait:50000)
[ 133.3529827] go[0x18] next[a=80,c=18]: DMA xfer: 1(0xc5017000:10000)
[ 133.4282821] > done i=1 stat=ff
[ 133.4648820] intr[csr=0x4b]next[a=80,c=4b]: {16:}60=STS:00=
[ 133.5307481] dma_stop
[ 133.5569098] scsidone: (1,0)->(1,0)00
[ 133.5997782] intr done. state=1, asr=0x80
[ 133.6468485] wd33c93_scsi_request: req 0x0
[ 133.6949635] wd33c93_sched(1,0)
[ 133.7315436] wd33c93_go(1:0)
[ 133.7650042] wd33c93_go dmago:1(tcnt=0) dmaok=1x
[ 133.8193878] wd33c93_selectbus 1: Selection Complete
[ 133.8779606] wd33c93_setsync: sync reg = 0xac
[ 133.9291898] go[0x1a] next[a=00,c=1a]: wd33c93_xfout {6} 0a 00 21 60 80
00 00 00 01 00
[ 134.0233171] wd33c93_xfout done: 0 bytes remaining (wait:50000)
[ 134.0933725] go[0x18] next[a=80,c=18]: DMA xfer: 1(0xc5027000:10000)
[ 134.1686688] > done i=1 stat=ff
[ 134.6952622] intr[csr=0x41]next[a=80,c=41]: wd33c93next target 1
disconnected
[ 134.7779182] dma_stop
[ 134.8040586] wd33c93sched: no work
[ 134.8437959] intr done. state=1, asr=0x80
[ 194.2408528] sd0(wdsc0:0:1:0): wdsc0: timed out; asr=0x00 [acb 0x97ec4fa8
(flags 0x1, dleft 10000)], <state 1, nexus 0x0, resid d000, msg(q 0,o
0)>pid 0(system): trap: cpu0, TLB miss (load or instr. fetch) in kernel mode
[ 194.3915387] status=0xf003, cause=0x108, epc=0x880c16e8, vaddr=0
[ 194.3915387] tf=0x88041dc0 ksp=0x88041e60 ra=0x880c16e4 ppl=0xf003
[ 194.3915387] kernel: TLB miss (load or instr. fetch) trap
Stopped in pid 0.5 (system) at  880c16e8:       lw      v0,8(s1)

Everything's fine until it does a couple of 64k transfers, and then the
drive goes offline.
The code around "target %d disconnected" explictly sets sc->sc_nexus =
NULL; then calls sched, but there's no further work, and eventually the
original transfer times out with an old transfer block and you end up with
a panic.

Yes the drive is a zuluscsi and it's very plausible it is unhappy with
either the large transfer sizes or back to back transfers, and I'm going to
dig into that, but I'd at least like to see about getting the scsi driver
handling errors right.

So, where's a good place / who's a good person to start digging into the
scsi state machine and driver error handling here?

Thanks!


-adrian


Home | Main Index | Thread Index | Old Index