Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?

To: jdolecek%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost, rokuyama.rk%gmail.com@localhost
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
From: Rin Okuyama <rokuyama.rk%gmail.com@localhost>
Date: Sat, 31 Aug 2019 09:55:01 +0000 (UTC)

The following reply was made to PR kern/54503; it has been noted by GNATS.

From: Rin Okuyama <rokuyama.rk%gmail.com@localhost>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek%gmail.com@localhost>
Cc: "gnats-bugs%NetBSD.org@localhost" <gnats-bugs%netbsd.org@localhost>
Subject: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs
 >= 32 ?
Date: Sat, 31 Aug 2019 18:52:18 +0900

 On 2019/08/30 23:28, JaromÃr DoleÄ?ek wrote:
 > Can you please try a kernel with nvme_q_complete() marked __noinline,
 > to see where exactly inside that function the code panics? I've
 > reviewed the code and I don't see any particular reason why it would
 > fail while setting up 32nd queue.
 
 nvme_q_complete() is not expanded inline even without __noinline.
 
 Instruction at the fault address,
 
 	allocated pic msix4 type edge pin 31 level 6 to cpu0 slot 22 idt entry 136
 	nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
 	prevented execution of 0x0 (SMEP)
 	fatal page fault in supervisor mode
 	...
 	db{0}> bt
 	db_disasm() at netbsd:db_disasm+0xcd
 	db_trap() at netbsd:db_trap+0x16b
 	kdb_trap() at netbsd:kdb_trap+0x12a
 	trap() at netbsd:trap+0x49d
 	--- trap (number 6) ---
 	?() at 0
 	nvme_poll() at netbsd:nvme_poll+0x154
 	nvme_attach() at netbsd:nvme_attach+0x12d2
 	nvme_pci_attach() at netbsd:nvme_pci_attach+0x6fe
 	...
 
 nvme_poll+0x154 is "test %eax,%eax" just after returning from
 nvme_q_complete() in nvme.c:1261.
 
 https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1261
 
     1240 static int
     1241 nvme_poll(struct nvme_softc *sc, struct nvme_queue *q, struct nvme_ccb *ccb,
     1242     void (*fill)(struct nvme_queue *, struct nvme_ccb *, void *), int timo_sec)
     1243 {
     ....
     1261 		if (nvme_q_complete(sc, q) == 0)
 
 0000000000004903 <nvme_poll>:
      ....
      4a52:       e8 ee c4 ff ff          callq  f45 <nvme_q_complete>
      4a57:       85 c0                   test   %eax,%eax
      ....
 
 I don't understand why "execution of NULL" occurs by such a
 instruction. Maybe this is an incorrect alert. By using kernel
 with option KUBSAN, I found
 	
 	nvme0: for io queue 31 interrupting at msix4 vec 31 affinity to cpu30
 	prevented execution of 0x0 (SMEP)
 	fatal page fault in supervisor mode
 	...
 	db{0}> reboot
 	UBSan: Undefined Behavior in ../../../../dev/ic/nvme.c:1306:11, member access within null pointer of type 'struct nvme_poll_state'
 	fatal page fault in supervisor mode
 	trap type 6 code 0x2 rip 0xffffffff8102487c cs 0x8 rflags 0x10246 cr2 0x40 ilevel 0x3 rsp 0xffffd0b57e9b5fc0
 	curlwp 0xffffc02d225038c0 pid 0.4 lowest kstack 0xffffd0b57e9b22c0
 	kernel: page fault trap, code=0
 	Stopped in pid 0.4 (system) at  netbsd:nvme_poll_done+0x7a:     movq    %rsi,40(%rbx)
 
 This should be the real cause of panic. Then, backtrace reads,
 
 	db{0}> bt
 	nvme_poll_done() at netbsd:nvme_poll_done+0x7a          <--- nvme.c:1306
 	nvme_q_complete() at netbsd:nvme_q_complete+0x259       <--- nvme.c:1374
 	softint_dispatch() at netbsd:softint_dispatch+0x20b
 	DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffd0b57e9b60f0
 	Xsoftintr() at netbsd:Xsoftintr+0x4f
 	--- interrupt ---
 	0:
 	db{0}>
 
 where
 
 https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1306
 
     1299 static void
     1300 nvme_poll_done(struct nvme_queue *q, struct nvme_ccb *ccb,
     1301     struct nvme_cqe *cqe)
     1302 {
     1303 	struct nvme_poll_state *state = ccb->ccb_cookie;
     ....
     1306 	state->c = *cqe;
 
 https://nxr.netbsd.org/xref/src/sys/dev/ic/nvme.c#1374
 
     1327 static int
     1328 nvme_q_complete(struct nvme_softc *sc, struct nvme_queue *q)
     1329 {
     ....
     1372 		mutex_exit(&q->q_cq_mtx);
     1373 		ccb->ccb_done(q, ccb, cqe);
     1374 		mutex_enter(&q->q_cq_mtx);
 
 Do you think why ccb->ccb_cookie becomes NULL? I uploaded full
 dmesg and log of DDB at:
 
 http://www.netbsd.org/~rin/nvme_panic_20190829/dmesg.20190831
 
 Thanks,
 rin

Follow-Ups:
- Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
  - From: Kimihiro Nonaka

Prev by Date: Re: bin/54514: script(1) sometimes swallows last line(s) of output
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
Next by Thread: Re: kern/54503: Panic during attaching nvme(4) when # of logical CPUs >= 32 ?
Indexes:

Home | Main Index | Thread Index | Old Index