current-users: still problems in scsi "error" handling?

Subject: still problems in scsi "error" handling?
To: NetBSD-current Discussion List <current-users@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 12/17/2001 16:40:32
Last night my machine panic'ed (I really need to take a week off and
update to -current again!), and during the fsck one of the disks
"suffered" a recovered error (I really need to replace that disk! ;-)

(this output copied from conserver's logs)

[Mon Dec 17 02:18:36 2001]sd0(ahc0:0:0:0):  Check Condition on CDB: 0x28 00 00 83 6e 50 00 00 10 00
[Mon Dec 17 02:18:36 2001]^M    SENSE KEY:  Recovered Error
[Mon Dec 17 02:18:36 2001]^M   INFO FIELD:  8613456
[Mon Dec 17 02:18:36 2001]^M COMMAND INFO:  383452517 (0x16db0565)
[Mon Dec 17 02:18:36 2001]^M     ASC/ASCQ:  ASC 0x17 ASCQ 0xc1
[Mon Dec 17 02:18:36 2001]^M         SKSV:  Actual Retry Count: 25
[Mon Dec 17 02:18:36 2001]


Unfortunately that caused fsck to barf unexpectedly (as I understand it
a recovered error should not cause a read error):

[Mon Dec 17 02:18:36 2001]^MCANNOT READ: BLK 5572656
[Mon Dec 17 02:18:36 2001]/dev/rsd0f: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.


I have a vague memory suggesting this has been fixed since 2001/06/24 in
-current, but has it really, or should I send-pr?


FYI here's the first part of the panic message:

[Mon Dec 17 02:13:20 2001]^Mpanic: trap
[Mon Dec 17 02:13:20 2001]^MBegin traceback...
[Mon Dec 17 02:13:20 2001]^Mtrap() at trap+0x23c
[Mon Dec 17 02:13:20 2001]^M--- trap (number 6) ---
[Mon Dec 17 02:13:20 2001]^Mflush_inodedep_deps(c09ca000,1cc458,c04297c0,0,ffffffff) at flush_inodedep_deps+0xec
[Mon Dec 17 02:13:20 2001]^Msoftdep_sync_metadata(cbb9ac6c) at softdep_sync_metadata+0x6b
[Mon Dec 17 02:13:20 2001]^Mffs_full_fsync(cbb9ac6c,cbb961d4,0,c08b2300,c02aee1e) at ffs_full_fsync+0x1d5
[Mon Dec 17 02:13:20 2001]^Mffs_fsync(cbb9ac6c,0,c02754cc,cbb9ae08,cbb9ac6c) at ffs_fsync+0x3a
[Mon Dec 17 02:13:20 2001]^Mnfsrv_commit(c0ce8a00,c08b2b00,cbb961d4,cbb9ae08,0) at nfsrv_commit+0x519
[Mon Dec 17 02:13:20 2001]^Mnfssvc_nfsd(cbb9ae68,804b380,cbb961d4,cbb9af78,cbb961d4) at nfssvc_nfsd+0x52f
[Mon Dec 17 02:13:20 2001]^Msys_nfssvc(cbb961d4,cbb9af80,cbb9af78) at sys_nfssvc+0x6f7
[Mon Dec 17 02:13:20 2001]^Msyscall_plain(bfbf001f,1f,bfbf001f,1f,bfbfdce4) at syscall_plain+0x98
[Mon Dec 17 02:13:20 2001]^MEnd traceback...
[Mon Dec 17 02:13:20 2001]^Msyncing disks... panic: softdep_disk_write_complete: lock is held
[Mon Dec 17 02:13:20 2001]^MBegin traceback...
[Mon Dec 17 02:13:20 2001]^Msoftdep_disk_write_complete(c4a39110) at softdep_disk_write_complete+0x1c
[Mon Dec 17 02:13:20 2001]^Mbiodone(c4a39110) at biodone+0x4b
[Mon Dec 17 02:13:20 2001]^Mscsipi_complete(c0cf53a8,0,c0904208,0,11009) at scsipi_complete+0x376
[Mon Dec 17 02:13:20 2001]^Mscsipi_done(c0cf53a8,c0904208,c08b9c80,c0902e00,23) at scsipi_done+0xfa
[Mon Dec 17 02:13:20 2001]^Mahc_done(c0902e00,c0904208) at ahc_done+0x239
[Mon Dec 17 02:13:21 2001]^Mahc_run_qoutfifo(c0902e00) at ahc_run_qoutfifo+0xad
[Mon Dec 17 02:13:21 2001]^Mahc_intr(c0902e00) at ahc_intr+0x7a
[Mon Dec 17 02:13:21 2001]^MXintr11() at Xintr11+0x7c
[Mon Dec 17 02:13:21 2001]^M--- interrupt ---
[Mon Dec 17 02:13:21 2001]^MXspllower(cbb9aa18,cbb9aa0c,c01ae115,100,0) at Xspllower+0xe
[Mon Dec 17 02:13:21 2001]^Mcpu_reboot(100,0,6,cbb961d4,0) at cpu_reboot+0x3b
[Mon Dec 17 02:13:21 2001]^Mpanic(c03bd17b,c000ff66,deadbeef,2,0) at panic+0xf5
[Mon Dec 17 02:13:21 2001]^Mtrap() at trap+0x23c
[Mon Dec 17 02:13:21 2001]^M--- trap (number 6) ---
[Mon Dec 17 02:13:21 2001]^Mflush_inodedep_deps(c09ca000,1cc458,c04297c0,0,ffffffff) at flush_inodedep_deps+0xec
[Mon Dec 17 02:13:21 2001]^Msoftdep_sync_metadata(cbb9ac6c) at softdep_sync_metadata+0x6b
[Mon Dec 17 02:13:21 2001]^Mffs_full_fsync(cbb9ac6c,cbb961d4,0,c08b2300,c02aee1e) at ffs_full_fsync+0x1d5
[Mon Dec 17 02:13:21 2001]^Mffs_fsync(cbb9ac6c,0,c02754cc,cbb9ae08,cbb9ac6c) at ffs_fsync+0x3a
[Mon Dec 17 02:13:21 2001]^Mnfsrv_commit(c0ce8a00,c08b2b00,cbb961d4,cbb9ae08,0) at nfsrv_commit+0x519
[Mon Dec 17 02:13:21 2001]^Mnfssvc_nfsd(cbb9ae68,804b380,cbb961d4,cbb9af78,cbb961d4) at nfssvc_nfsd+0x52f
[Mon Dec 17 02:13:21 2001]^Msys_nfssvc(cbb961d4,cbb9af80,cbb9af78) at sys_nfssvc+0x6f7
[Mon Dec 17 02:13:21 2001]^Msyscall_plain(bfbf001f,1f,bfbf001f,1f,bfbfdce4) at syscall_plain+0x98
[Mon Dec 17 02:13:22 2001]^MEnd traceback...
[Mon Dec 17 02:13:22 2001]
[Mon Dec 17 02:13:22 2001]^Mdumping to dev 4,1 offset 621096
[Mon Dec 17 02:13:22 2001]^Mdump panic: softdep_disk_write_complete: lock is held
[Mon Dec 17 02:13:22 2001]^MBegin traceback...

and again it tried and paniced twice more before finally getting a "dump
device not ready" and rebooting....

Unfortuanetly the client that did the NFS operation which apparently
caused the panic was also runing a "console" client in spy mode, which
made all attempts to use conserver unviable (that machine was
effectively hung in NFS never-never land) and I didn't dare re-patch the
console to find out what happened until after the RAIDframe parity
rewrite completed (about 45 minutes later....)  :-(

-- 
								Greg A. Woods

+1 416 218-0098;  <gwoods@acm.org>;  <g.a.woods@ieee.org>;  <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>