NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)



>Number:         54790
>Category:       kern
>Synopsis:       9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 20 21:55:00 +0000 2019
>Originator:     Izumi Tsutsui
>Release:        NetBSD 9.0_RC1
>Organization:
>Environment:
System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019
    mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/i386/compile/GENERIC
Architecture: i386
Machine: i386
>Description:
I'm getting reproducible kernel fault in ata_recovery_resume()
on my 9.0_RC1 i386 machines.  It looks triggered by SSD error,
but I wonder if the errors are real hardware faiulre or not.
(not seen on 8.1 kernel)

ddb says (typed from screen pic):
---
kernle: supervisor trap page fault, code=0
Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:       movzwl  8(%eax),%edx
db{0}> bt
ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441) at netbsd:ata_recovery_resume+0xe3
ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88,c4d488c0,0) at netbsd:ahci_channel_recover+0x82
ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc,c509fc00) at netbsd:ata_thread_run+0x1f3
atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at netbsd:atabus_thread+0x228
>db{0}>
---

dmesg on the ddb prompt say (timestamp is omitted to save typing):
---
 :
ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
ahcisata0: ignoring broken port multiplier support
ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP 0xf3209f83<CCCS,PMD,ISS=0x2=Gen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A>
ahcisata0: interrupting at ioapic0 pin 22
atabus0 at ahcisata0 channel 0
atabus1 at ahcisata0 channel 1
atabus2 at ahcisata0 channel 2
atabus3 at ahcisata0 channel 3
 :
ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller (rev. 0x00)
ixpide0: bus-master DMA support present
ixpide0: primary channel configured to compatibility mode
ixpide0: primary channel interrupting at ioapic0 pin 14
atabus4 at ixpide0 channel 0
ixpide0: secondary channel configured to compatibility mode
ixpide0: secondary channel interrupting at ioapic0 pin 15
atabus5 at ixpide0 channel 1
 :
ahcisata0 port 0: device present, speed: 3.0Gb/s
ahcisata0 port 1: device present, speed: 3.0Gb/s
ahcisata0 port 2: device present, speed: 3.0Gb/s
ahcisata0 port 3: device present, speed: 1.5Gb/s
 :
wd0 at atabus0 drive 0
wd0: <Hitachi HDS5C3020ALA632>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
wd1 at atabus1 drive 0
wd1: <Hitachi HDS5C3020ALA632>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags) w/PRIO
wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
wd2 at atabus2 drive 0
wd2: <Samsung SSD 860 EVO 500GB>
wd2: drive supports 1-sector PIO transfers, LBA48 addressing
wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)
atapibus0 at atabus3: 1 targets
cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA)
 :
wsmux1: connecting to wsdisplay0
cd0(ahcisata0:3:0):  DEFERRED ERROR, key = 0x2
wsdisplay0: screen 1 added (default, vt100 emulation)
wsdisplay0: screen 2 added (default, vt100 emulation)
wsdisplay0: screen 3 added (default, vt100 emulation)
wsdisplay0: screen 4 added (default, vt100 emulation)
cd0(ahcisata0:3:0):  DEFERRED ERROR, key = 0x2
wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 bn 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0
wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 bn 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0
 :
[many similar errors]
 :
uvm_fault(0xc13737e0, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0 esp 0xc51abb88
curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0
db{0}> 
---

"0xc018305f" is here:
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=1.2#240
---
    234 	/* Requeue all unfinished commands for same drive as failed command */
    235 	for (slot = 0; slot < ch_openings; slot++) {
    236 		if ((ata_queue_active(chp) & (1U << slot)) == 0)
    237 			continue;
    238 
    239 		xfer = ata_queue_hwslot_to_xfer(chp, slot);
->  240 		if (drive != xfer->c_drive) 
    241 			continue;
    242 
    243 		xfer->ops->c_kill_xfer(chp, xfer,
    244 		    (error == 0) ? KILL_REQUEUE : KILL_RESET);
    245 	}
---
Per dumb printf debug, actually "xfer" is NULL on the fault.

>How-To-Repeat:
~100% reproducible on my Samsung SSD with load on my main machine
(ASRock M3A UCC http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.asp )
but not sure if it can happen on other machines.

>Fix:
No idea.
Is it worth to have some kernel config option to disable NCQ,
if it's triggered by the feature?

---
Izumi Tsutsui



Home | Main Index | Thread Index | Old Index