NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
The following reply was made to PR kern/54790; it has been noted by GNATS.
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek%gmail.com@localhost>
To: "gnats-bugs%NetBSD.org@localhost" <gnats-bugs%netbsd.org@localhost>
Cc:
Subject: Re: kern/54790: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ support?)
Date: Sun, 22 Dec 2019 22:48:41 +0100
--0000000000007e5485059a51e0ff
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Can you try kernel with DEBUG+DIAGNOSTIC?
There is KASSERTMSG() which should trigger if the xfer is no longer active
- after ata_queue_active() returns the slot as active, it should never
actually happen the ata_queue_hwslot_to_xfer() returns NULL.
Jaromir
Le ven. 20 d=C3=A9c. 2019 =C3=A0 22:55, Izumi Tsutsui <tsutsui%ceres.dti.ne@localhost=
.jp> a
=C3=A9crit :
> >Number: 54790
> >Category: kern
> >Synopsis: 9.0_RC1 kernel crash in ata_recovery_resume() (in NCQ
> support?)
> >Confidential: no
> >Severity: critical
> >Priority: high
> >Responsible: kern-bug-people
> >State: open
> >Class: sw-bug
> >Submitter-Id: net
> >Arrival-Date: Fri Dec 20 21:55:00 +0000 2019
> >Originator: Izumi Tsutsui
> >Release: NetBSD 9.0_RC1
> >Organization:
> >Environment:
> System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019
> mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/i386/compile/GENERIC
> Architecture: i386
> Machine: i386
> >Description:
> I'm getting reproducible kernel fault in ata_recovery_resume()
> on my 9.0_RC1 i386 machines. It looks triggered by SSD error,
> but I wonder if the errors are real hardware faiulre or not.
> (not seen on 8.1 kernel)
>
> ddb says (typed from screen pic):
> ---
> kernle: supervisor trap page fault, code=3D0
> Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:
> movzwl 8(%eax),%edx
> db{0}> bt
> ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441)
> at netbsd:ata_recovery_resume+0xe3
> ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88=
,c4d488c0,0)
> at netbsd:ahci_channel_recover+0x82
> ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc=
,c509fc00)
> at netbsd:ata_thread_run+0x1f3
> atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at
> netbsd:atabus_thread+0x228
> >db{0}>
> ---
>
> dmesg on the ddb prompt say (timestamp is omitted to save typing):
> ---
> :
> ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
> ahcisata0: ignoring broken port multiplier support
> ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP
> 0xf3209f83<CCCS,PMD,ISS=3D0x2=3DGen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A>
> ahcisata0: interrupting at ioapic0 pin 22
> atabus0 at ahcisata0 channel 0
> atabus1 at ahcisata0 channel 1
> atabus2 at ahcisata0 channel 2
> atabus3 at ahcisata0 channel 3
> :
> ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller
> (rev. 0x00)
> ixpide0: bus-master DMA support present
> ixpide0: primary channel configured to compatibility mode
> ixpide0: primary channel interrupting at ioapic0 pin 14
> atabus4 at ixpide0 channel 0
> ixpide0: secondary channel configured to compatibility mode
> ixpide0: secondary channel interrupting at ioapic0 pin 15
> atabus5 at ixpide0 channel 1
> :
> ahcisata0 port 0: device present, speed: 3.0Gb/s
> ahcisata0 port 1: device present, speed: 3.0Gb/s
> ahcisata0 port 2: device present, speed: 3.0Gb/s
> ahcisata0 port 3: device present, speed: 1.5Gb/s
> :
> wd0 at atabus0 drive 0
> wd0: <Hitachi HDS5C3020ALA632>
> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
> wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168
> sectors
> wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
> WRITE DMA FUA, NCQ (32 tags) w/PRIO
> wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
> wd1 at atabus1 drive 0
> wd1: <Hitachi HDS5C3020ALA632>
> wd1: drive supports 16-sector PIO transfers, LBA48 addressing
> wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168
> sectors
> wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
> WRITE DMA FUA, NCQ (32 tags) w/PRIO
> wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA), NCQ (31 tags) w/PRIO
> wd2 at atabus2 drive 0
> wd2: <Samsung SSD 860 EVO 500GB>
> wd2: drive supports 1-sector PIO transfers, LBA48 addressing
> wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168
> sectors
> wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133),
> WRITE DMA FUA, NCQ (32 tags)
> wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA), NCQ (31 tags)
> atapibus0 at atabus3: 1 targets
> cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00>
> cdrom removable
> cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
> cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6
> (Ultra/133) (using DMA)
> :
> wsmux1: connecting to wsdisplay0
> cd0(ahcisata0:3:0): DEFERRED ERROR, key =3D 0x2
> wsdisplay0: screen 1 added (default, vt100 emulation)
> wsdisplay0: screen 2 added (default, vt100 emulation)
> wsdisplay0: screen 3 added (default, vt100 emulation)
> wsdisplay0: screen 4 added (default, vt100 emulation)
> cd0(ahcisata0:3:0): DEFERRED ERROR, key =3D 0x2
> wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 b=
n
> 343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0
> wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 b=
n
> 479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0
> :
> [many similar errors]
> :
> uvm_fault(0xc13737e0, 0, 1) -> 0xe
> fatal page fault in supervisor mode
> trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0
> esp 0xc51abb88
> curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0
> db{0}>
> ---
>
> "0xc018305f" is here:
> https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=3D1.2#240
> ---
> 234 /* Requeue all unfinished commands for same drive as
> failed command */
> 235 for (slot =3D 0; slot < ch_openings; slot++) {
> 236 if ((ata_queue_active(chp) & (1U << slot)) =3D=3D=
0)
> 237 continue;
> 238
> 239 xfer =3D ata_queue_hwslot_to_xfer(chp, slot);
> -> 240 if (drive !=3D xfer->c_drive)
> 241 continue;
> 242
> 243 xfer->ops->c_kill_xfer(chp, xfer,
> 244 (error =3D=3D 0) ? KILL_REQUEUE : KILL_RESET)=
;
> 245 }
> ---
> Per dumb printf debug, actually "xfer" is NULL on the fault.
>
> >How-To-Repeat:
> ~100% reproducible on my Samsung SSD with load on my main machine
> (ASRock M3A UCC http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.asp )
> but not sure if it can happen on other machines.
>
> >Fix:
> No idea.
> Is it worth to have some kernel config option to disable NCQ,
> if it's triggered by the feature?
>
> ---
> Izumi Tsutsui
>
>
--0000000000007e5485059a51e0ff
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr">Can you try kernel with DEBUG+DIAGNOSTIC?<div><br></div><div>Ther=
e is KASSERTMSG() which should trigger if the xfer is no longer active - af=
ter ata_queue_active() returns the slot as active, it should never actually=
happen the ata_queue_hwslot_to_xfer() returns NULL.</div><div><br></div><d=
iv>Jaromir</div></div></div></div></div></div><br><div class=3D"gmail_quote=
"><div dir=3D"ltr" class=3D"gmail_attr">Le=C2=A0ven. 20 d=C3=A9c. 2019 =C3=
=A0=C2=A022:55, Izumi Tsutsui <<a href=3D"mailto:tsutsui%ceres.dti.ne.jp@localhost=
">tsutsui%ceres.dti.ne.jp@localhost</a>> a =C3=A9crit=C2=A0:<br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1=
px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:=
1ex">>Number:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A054790<br>
>Category:=C2=A0 =C2=A0 =C2=A0 =C2=A0kern<br>
>Synopsis:=C2=A0 =C2=A0 =C2=A0 =C2=A09.0_RC1 kernel crash in ata_recover=
y_resume() (in NCQ support?)<br>
>Confidential:=C2=A0 =C2=A0no<br>
>Severity:=C2=A0 =C2=A0 =C2=A0 =C2=A0critical<br>
>Priority:=C2=A0 =C2=A0 =C2=A0 =C2=A0high<br>
>Responsible:=C2=A0 =C2=A0 kern-bug-people<br>
>State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 open<br>
>Class:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sw-bug<br>
>Submitter-Id:=C2=A0 =C2=A0net<br>
>Arrival-Date:=C2=A0 =C2=A0Fri Dec 20 21:55:00 +0000 2019<br>
>Originator:=C2=A0 =C2=A0 =C2=A0Izumi Tsutsui<br>
>Release:=C2=A0 =C2=A0 =C2=A0 =C2=A0 NetBSD 9.0_RC1<br>
>Organization:<br>
>Environment:<br>
System: NetBSD 9.0_RC1 (GENERIC) #0: Wed Nov 27 16:14:52 UTC 2019<br>
=C2=A0 =C2=A0 mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/i386/compile/GEN=
ERIC<br>
Architecture: i386<br>
Machine: i386<br>
>Description:<br>
I'm getting reproducible kernel fault in ata_recovery_resume()<br>
on my 9.0_RC1 i386 machines.=C2=A0 It looks triggered by SSD error,<br>
but I wonder if the errors are real hardware faiulre or not.<br>
(not seen on 8.1 kernel)<br>
<br>
ddb says (typed from screen pic):<br>
---<br>
kernle: supervisor trap page fault, code=3D0<br>
Stopped in pid 0.41 (system) at netbsd:ata_recovery_resume+0xe3:=C2=A0 =C2=
=A0 =C2=A0 =C2=A0movzwl=C2=A0 8(%eax),%edx<br>
db{0}> bt<br>
ata_recovery_resume(c51abb88,0,8441,8,c08a608e,0,8441,8000,c51abb88,8441) a=
t netbsd:ata_recovery_resume+0xe3<br>
ahci_channel_recover(c51abb88,8,8441,c0fc4238,1277b90,c51ab000,8,c51abb88,c=
4d488c0,0) at netbsd:ahci_channel_recover+0x82<br>
ata_thread_run(c51abb88,8,8000,8441,c51abb90,6,c51abc98,c5197080,c01813fc,c=
509fc00) at netbsd:ata_thread_run+0x1f3<br>
atabus_thread(c5197080,1540000,154a000,0,c01003fd,0,0,0,0,0) at netbsd:atab=
us_thread+0x228<br>
>db{0}><br>
---<br>
<br>
dmesg on the ddb prompt say (timestamp is omitted to save typing):<br>
---<br>
=C2=A0:<br>
ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)<b=
r>
ahcisata0: ignoring broken port multiplier support<br>
ahcisata0: AHCI revision 1.10, 4 ports, 32 slots, CAP 0xf3209f83<CCCS,PM=
D,ISS=3D0x2=3DGen2,SCLO,SAL,SMPS,SSNTF,SNCQ,S64A><br>
ahcisata0: interrupting at ioapic0 pin 22<br>
atabus0 at ahcisata0 channel 0<br>
atabus1 at ahcisata0 channel 1<br>
atabus2 at ahcisata0 channel 2<br>
atabus3 at ahcisata0 channel 3<br>
=C2=A0:<br>
ixpide0 at pci0 dev 20 function 1: ATI Technologies IXP IDE Controller (rev=
. 0x00)<br>
ixpide0: bus-master DMA support present<br>
ixpide0: primary channel configured to compatibility mode<br>
ixpide0: primary channel interrupting at ioapic0 pin 14<br>
atabus4 at ixpide0 channel 0<br>
ixpide0: secondary channel configured to compatibility mode<br>
ixpide0: secondary channel interrupting at ioapic0 pin 15<br>
atabus5 at ixpide0 channel 1<br>
=C2=A0:<br>
ahcisata0 port 0: device present, speed: 3.0Gb/s<br>
ahcisata0 port 1: device present, speed: 3.0Gb/s<br>
ahcisata0 port 2: device present, speed: 3.0Gb/s<br>
ahcisata0 port 3: device present, speed: 1.5Gb/s<br>
=C2=A0:<br>
wd0 at atabus0 drive 0<br>
wd0: <Hitachi HDS5C3020ALA632><br>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing<br>
wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sec=
tors<br>
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
RITE DMA FUA, NCQ (32 tags) w/PRIO<br>
wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA), NCQ (31 tags) w/PRIO<br>
wd1 at atabus1 drive 0<br>
wd1: <Hitachi HDS5C3020ALA632><br>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing<br>
wd1: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sec=
tors<br>
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
RITE DMA FUA, NCQ (32 tags) w/PRIO<br>
wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA), NCQ (31 tags) w/PRIO<br>
wd2 at atabus2 drive 0<br>
wd2: <Samsung SSD 860 EVO 500GB><br>
wd2: drive supports 1-sector PIO transfers, LBA48 addressing<br>
wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sector=
s<br>
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), W=
RITE DMA FUA, NCQ (32 tags)<br>
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA), NCQ (31 tags)<br>
atapibus0 at atabus3: 1 targets<br>
cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH24NSD5, KLUIBRA1411, LJ00&g=
t; cdrom removable<br>
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)<br=
>
cd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/1=
33) (using DMA)<br>
=C2=A0:<br>
wsmux1: connecting to wsdisplay0<br>
cd0(ahcisata0:3:0):=C2=A0 DEFERRED ERROR, key =3D 0x2<br>
wsdisplay0: screen 1 added (default, vt100 emulation)<br>
wsdisplay0: screen 2 added (default, vt100 emulation)<br>
wsdisplay0: screen 3 added (default, vt100 emulation)<br>
wsdisplay0: screen 4 added (default, vt100 emulation)<br>
cd0(ahcisata0:3:0):=C2=A0 DEFERRED ERROR, key =3D 0x2<br>
wd2a: device timeout reading fsbn 343200640 of 343200640-343200647 (wd2 bn =
343200640; cn 167578 tn 14 sn 0), xfer dcc, retry 0<br>
wd2a: device timeout writing fsbn 479102685 of 479102605-479102719 (wd2 bn =
479102685; cn 233936 tn 27 sn 29), xfer 7f0, retry 0<br>
=C2=A0:<br>
[many similar errors]<br>
=C2=A0:<br>
uvm_fault(0xc13737e0, 0, 1) -> 0xe<br>
fatal page fault in supervisor mode<br>
trap type 6 code 0 eip 0xc018305f cs 0x8 eflags 0x10286 cr2 0x8 ilevel 0 es=
p 0xc51abb88<br>
curlwp 0xc509fc00 pid 0 lid 41 lowest kstack 0xdc7da2c0<br>
db{0}> <br>
---<br>
<br>
"0xc018305f" is here:<br>
<a href=3D"https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=3D1=
.2#240" rel=3D"noreferrer" target=3D"_blank">https://nxr.netbsd.org/xref/sr=
c/sys/dev/ata/ata_recovery.c?r=3D1.2#240</a><br>
---<br>
=C2=A0 =C2=A0 234=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Requeue all unfinishe=
d commands for same drive as failed command */<br>
=C2=A0 =C2=A0 235=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0for (slot =3D 0; slot &l=
t; ch_openings; slot++) {<br>
=C2=A0 =C2=A0 236=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0if ((ata_queue_active(chp) & (1U << slot)) =3D=3D 0)<br>
=C2=A0 =C2=A0 237=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;<br>
=C2=A0 =C2=A0 238 <br>
=C2=A0 =C2=A0 239=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0xfer =3D ata_queue_hwslot_to_xfer(chp, slot);<br>
->=C2=A0 240=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0if (drive !=3D xfer->c_drive) <br>
=C2=A0 =C2=A0 241=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;<br>
=C2=A0 =C2=A0 242 <br>
=C2=A0 =C2=A0 243=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0xfer->ops->c_kill_xfer(chp, xfer,<br>
=C2=A0 =C2=A0 244=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0(error =3D=3D 0) ? KILL_REQUEUE : KILL_RESET);<br>
=C2=A0 =C2=A0 245=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br>
---<br>
Per dumb printf debug, actually "xfer" is NULL on the fault.<br>
<br>
>How-To-Repeat:<br>
~100% reproducible on my Samsung SSD with load on my main machine<br>
(ASRock M3A UCC <a href=3D"http://www.asrock.com/mb/AMD/M3A%20UCC/index.jp.=
asp" rel=3D"noreferrer" target=3D"_blank">http://www.asrock.com/mb/AMD/M3A%=
20UCC/index.jp.asp</a> )<br>
but not sure if it can happen on other machines.<br>
<br>
>Fix:<br>
No idea.<br>
Is it worth to have some kernel config option to disable NCQ,<br>
if it's triggered by the feature?<br>
<br>
---<br>
Izumi Tsutsui<br>
<br>
</blockquote></div>
--0000000000007e5485059a51e0ff--
Home |
Main Index |
Thread Index |
Old Index