Please try the attached patch. On Sat, Aug 31, 2019 at 6:55 PM Rin Okuyama <rokuyama.rk%gmail.com@localhost> wrote: > > The following reply was made to PR kern/54503; it has been noted by GNATS. > > From: Rin Okuyama <rokuyama.rk%gmail.com@localhost> > To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek%gmail.com@localhost> > Cc: "gnats-bugs%NetBSD.org@localhost" <gnats-bugs%netbsd.org@localhost> > Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs > >= 32 ? > Date: Sat, 31 Aug 2019 18:52:18 +0900 > > On 2019/08/30 23:28, Jaromír Doleček wrote: > > Can you please try a kernel with nvme_q_complete() marked __noinline, > > to see where exactly inside that function the code panics? I've > > reviewed the code and I don't see any particular reason why it would > > fail while setting up 32nd queue. > > nvme_q_complete() is not expanded inline even without __noinline. > > Instruction at the fault address, > > allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 136 > nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30 > prevented execution of 0x0 (SMEP) > fatal page fault in supervisor mode > ... > db{0}> bt > db_disasm() at netbsd:db_disasm+0xcd > db_trap() at netbsd:db_trap+0x16b > kdb_trap() at netbsd:kdb_trap+0x12a > trap() at netbsd:trap+0x49d > --- trap (number 6) --- > ?() at 0 > nvme_poll() at netbsd:nvme_poll+0x154 > nvme_attach() at netbsd:nvme_attach+0x12d2 > nvme_pci_attach() at netbsd:nvme_pci_attach+0x6fe > ... > > nvme_poll+0x154 is "test %eax,%eax" just after returning from > nvme_q_complete() in nvme.c:1261. > > https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1261 > > 1240 static int > 1241 nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb, > 1242 void (*fill)(struct nvme_queue *, struct nvme_ccb *, void *), int timo_sec) > 1243 { > .... > 1261 if (nvme_q_complete(sc, q) == 0) > > 0000000000004903 <nvme_poll>: > .... > 4a52: e8 ee c4 ff ff callq f45 <nvme_q_complete> > 4a57: 85 c0 test %eax,%eax > .... > > I don't understand why "execution of NULL" occurs by such a > instruction. Maybe this is an incorrect alert. By using kernel > with option KUBSAN, I found > > nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30 > prevented execution of 0x0 (SMEP) > fatal page fault in supervisor mode > ... > db{0}> reboot > UBSan: Undefined Behavior in ../../../../dev/ic/nvme.c:1306:11, member access within null pointer of type 'struct nvme_poll_state' > fatal page fault in supervisor mode > trap type 6 code 0x2 rip 0xffffffff8102487c cs 0x8 rflags 0x10246 cr2 0x40 ilevel 0x3 rsp 0xffffd0b57e9b5fc0 > curlwp 0xffffc02d225038c0 pid 0.4 lowest kstack 0xffffd0b57e9b22c0 > kernel: page fault trap, code=0 > Stopped in pid 0.4 (system) at netbsd:nvme_poll_done+0x7a: movq %rsi,40(%rbx) > > This should be the real cause of panic. Then, backtrace reads, > > db{0}> bt > nvme_poll_done() at netbsd:nvme_poll_done+0x7a <--- nvme.c:1306 > nvme_q_complete() at netbsd:nvme_q_complete+0x259 <--- nvme.c:1374 > softint_dispatch() at netbsd:softint_dispatch+0x20b > DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffd0b57e9b60f0 > Xsoftintr() at netbsd:Xsoftintr+0x4f > --- interrupt --- > 0: > db{0}> > > where > > https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1306 > > 1299 static void > 1300 nvme_poll_done(struct nvme_queue *q, struct nvme_ccb *ccb, > 1301 struct nvme_cqe *cqe) > 1302 { > 1303 struct nvme_poll_state *state = ccb->ccb_cookie; > .... > 1306 state->c = *cqe; > > https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1374 > > 1327 static int > 1328 nvme_q_complete(struct nvme_softc *sc, struct nvme_queue *q) > 1329 { > .... > 1372 mutex_exit(&q->q_cq_mtx); > 1373 ccb->ccb_done(q, ccb, cqe); > 1374 mutex_enter(&q->q_cq_mtx); > > Do you think why ccb->ccb_cookie becomes NULL? I uploaded full > dmesg and log of DDB at: > > http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190831 > > Thanks, > rin >
Attachment:
nvme.diff
Description: Binary data