Re: kern/52769: hang with an ffs stored in an nvme device

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,martin%NetBSD.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
From: Chuck Silvers <chuq%chuq.com@localhost>
Date: Fri, 1 Dec 2017 18:50:01 +0000 (UTC)
The following reply was made to PR kern/52769; it has been noted by GNATS.

From: Chuck Silvers <chuq%chuq.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
	netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Fri, 1 Dec 2017 10:45:09 -0800

 I got a copy of the dump from martin and here are some things I found.
 the two buffers that he pointed out are for different parts of the same device:
 
 (gdb) p *(struct buf *)0xffffe4041d8f7240
 $4 = {
   b_u = {
     u_actq = {
       tqe_next = 0xffffe4040e6b6c68, 
       tqe_prev = 0xffffe4029c033ea0
     }, 
     u_rbnode = {
       rb_nodes = {0xffffe4040e6b6c68, 0xffffe4029c033ea0}, 
       rb_info = 0x0
     }, 
     u_work = {
       wk_dummy = 0xffffe4040e6b6c68
     }
   }, 
   b_iodone = 0x0, 
   b_error = 0x0, 
   b_resid = 0x388e, 
   b_flags = 0x404, 
 
 #define B_ASYNC         0x00000004      /* Start I/O, do not wait. */
 #define B_COWDONE       0x00000400      /* Copy-on-write already done. */
 
   b_prio = 0x1, 
   b_bufsize = 0x8000, 
   b_bcount = 0x8000, 
   b_dev = 0x1300, 
   b_data = 0xffff800148420000, 
   b_blkno = 0xddd06c0, 
   b_rawblkno = 0xddd0ec0, 
   b_proc = 0x0, 
   b_saveaddr = 0x0, 
   b_private = 0x0, 
   b_dcookie = 0x0, 
   b_busy = {
     cv_opaque = {0xffffe4040092db80, 0xffffe4040092dc90, 0xffffffff8116c394}
   }, 
   b_refcnt = 0x2, 
   b_unused = 0x0, 
   b_hash = {
     le_next = 0x0, 
     le_prev = 0xffffe403f8913f40
   }, 
   b_vnbufs = {
     le_next = 0xffffe4029c033ea0, 
     le_prev = 0xffffe4040e6b6d18
   }, 
   b_freelist = {
     tqe_next = 0xffffe4041739a6c0, 
     tqe_prev = 0xffffe403b883c308
   }, 
   b_wapbllist = {
     tqe_next = 0x0, 
     tqe_prev = 0x0
   }, 
   b_lblkno = 0xddd06c0, 
   b_freelistindex = 0x1, 
   b_cflags = 0x4800010, 
 
 #define BC_BUSY         0x00000010      /* I/O in progress. */
 #define BC_WANTED       0x00800000      /* Process wants this buffer. */
 #define BC_VFLUSH       0x04000000      /* Buffer is being synced. */
 
   b_vp = 0xffffe4042b5b2bd0, 
   b_done = {
     cv_opaque = {0x0, 0xffffe4041d8f7338, 0xffffffff8116c39c}
   }, 
   b_oflags = 0x0, 
   b_objlock = 0xffffe4042b619640
 }
 
 
 (gdb) p *(struct buf *)0xffffe40294774480
 $5 = {
   b_u = {
     u_actq = {
       tqe_next = 0xffffe403b48ac368, 
       tqe_prev = 0xffffe403b48ac488
     }, 
     u_rbnode = {
       rb_nodes = {0xffffe403b48ac368, 0xffffe403b48ac488}, 
       rb_info = 0x0
     }, 
     u_work = {
       wk_dummy = 0xffffe403b48ac368
     }
   }, 
   b_iodone = 0x0, 
   b_error = 0x0, 
   b_resid = 0x49ec, 
   b_flags = 0x404, 
 
 #define B_ASYNC         0x00000004      /* Start I/O, do not wait. */
 #define B_COWDONE       0x00000400      /* Copy-on-write already done. */
 
   b_prio = 0x1, 
   b_bufsize = 0x8000, 
   b_bcount = 0x8000, 
   b_dev = 0x1300, 
   b_data = 0xffff80014cfb0000, 
   b_blkno = 0x121f1100, 
   b_rawblkno = 0x121f1900, 
   b_proc = 0x0, 
   b_saveaddr = 0x0, 
   b_private = 0x0, 
   b_dcookie = 0x0, 
   b_busy = {
     cv_opaque = {0xffffe403ccf032c0, 0xffffe403ccf033d0, 0xffffffff8116c394}
   }, 
   b_refcnt = 0x2, 
   b_unused = 0x0, 
   b_hash = {
     le_next = 0xffffe403e8c7b480, 
     le_prev = 0xffff80001d75c158
   }, 
   b_vnbufs = {
     le_next = 0xffffe403b48ac488, 
     le_prev = 0xffffe403b48ac418
   }, 
   b_freelist = {
     tqe_next = 0xffffe40412c51120, 
     tqe_prev = 0xffffe4041739a300
   }, 
   b_wapbllist = {
     tqe_next = 0x0, 
     tqe_prev = 0x0
   }, 
   b_lblkno = 0x121f1100, 
   b_freelistindex = 0x1, 
   b_cflags = 0x4800010, 
 
 #define BC_BUSY         0x00000010      /* I/O in progress. */
 #define BC_WANTED       0x00800000      /* Process wants this buffer. */
 #define BC_VFLUSH       0x04000000      /* Buffer is being synced. */
 
   b_vp = 0xffffe4042b5b2bd0, 
   b_done = {
     cv_opaque = {0x0, 0xffffe40294774578, 0xffffffff8116c39c}
   }, 
   b_oflags = 0x0, 
   b_objlock = 0xffffe4042b619640
 }
 
 
 
 b_dev = 0x1300 is "ld0a", which is associated with nvme0 as martin mentioned.
 the buffers are B_ASYNC, so they will be unlocked automatically by biodone().
 however the nvme driver does not have any I/O pending:
 
 (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_admin_q
 $9 = {
   q_sc = 0xffff80001da34000, 
   q_sq_mtx = {
     u = {
       mtxa_owner = 0x601
     }
   }, 
   q_cq_mtx = {
     u = {
       mtxa_owner = 0x601
     }
   }, 
   q_sq_dmamem = 0xffffe4042f7ceac8, 
   q_cq_dmamem = 0xffffe4042f7ceaf0, 
   q_sqtdbl = 0x1000, 
   q_cqhdbl = 0x1004, 
   q_id = 0x0, 
   q_entries = 0x20, 
   q_sq_tail = 0x4, 
   q_cq_head = 0x4, 
   q_cq_phase = 0x1, 
   q_ccb_mtx = {
     u = {
       mtxa_owner = 0x601
     }
   }, 
   q_nccbs = 0x20, 
   q_nccbs_avail = 0x21, 
   q_ccbs = 0xffffe4011d6e6000, 
   q_ccb_list = {
     sqh_first = 0xffffe4011d6e6000, 
     sqh_last = 0xffffe4011d6e6e88
   }, 
   q_ccb_prpls = 0xffffe4042f7ceb18
 }
 
 (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_q[0]
 $12 = {
   q_sc = 0xffff80001da34000, 
   q_sq_mtx = {
     u = {
       mtxa_owner = 0x601
     }
   }, 
   q_cq_mtx = {
     u = {
       mtxa_owner = 0x601
     }
   }, 
   q_sq_dmamem = 0xffffe4042f7ceb40, 
   q_cq_dmamem = 0xffffe4042f7ceb68, 
   q_sqtdbl = 0x1008, 
   q_cqhdbl = 0x100c, 
   q_id = 0x1, 
   q_entries = 0x400, 
   q_sq_tail = 0x3b7, 
   q_cq_head = 0x3b7, 
   q_cq_phase = 0x0, 
   q_ccb_mtx = {
     u = {
       mtxa_owner = 0x601
     }
   }, 
   q_nccbs = 0x3ff, 
   q_nccbs_avail = 0x3ff, 
   q_ccbs = 0xffff800020d5b000, 
   q_ccb_list = {
     sqh_first = 0xffff800020d78cb8, 
     sqh_last = 0xffff800020d78f10
   }, 
   q_ccb_prpls = 0xffffe4042f7ceb90
 }
 
 
 the one normal queue is empty (q_nccbs == q_nccbs_avail, and q_sq_tail == q_cq_head).
 the admin queue is a bit odd in that q_nccbs_avail is one greater than q_nccbs,
 but it does appear to also be empty since q_sq_tail == q_cq_head.
 
 
 the two buffers above, along with 554 others, are still in the bufq of ld0:
 
 (gdb) p ((struct ld_softc *)ld_cd.cd_devs[0]->dv_private)->sc_dksc.sc_bufq
 $1 = (struct bufq_state *) 0xffffe4042f686ce8
 (gdb) xbufq_fcfs 0xffffe4042f686ce8
 bufq 0xffffe4042f686ce8
 0xffffe4025ff23b48
 0xffffe4025ff23a28
 0xffffe4025ff23908
 0xffffe4025ff237e8
 0xffffe4025ff236c8
 ...
 0xffffe403b23b4480
 0xffffe403cfef36c0
 total buffers 556
 
 
 ld0's dk_softc also indicates that no buffers are currently in the hands of
 the underlying driver:
 
 (gdb) p *(struct ld_softc *)ld_cd.cd_devs[0]->dv_private
 $14 = {
   sc_dksc = {
 ...
     sc_bufq = 0xffffe4042f686ce8, 
     sc_dtype = 0xf, 
     sc_deferred = 0xffffe4025ff23c68, 
     sc_busy = 0x0, 
 ...
   sc_queuecnt = 0x0, 
 ...
   sc_maxqueuecnt = 0x3ff, 
 }
 
 
 so the bug appears to be that the dk layer is sitting on a bunch of buffers
 even though the underlying nvme driver is idle.
 
 one problem I see is in ld_diskstart():
 
 static int
 ld_diskstart(device_t dev, struct buf *bp)
 {
 	struct ld_softc *sc = device_private(dev);
 	int error;
 
 	if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
 		return EAGAIN;
 
 	if ((sc->sc_flags & LDF_MPSAFE) == 0)
 		KERNEL_LOCK(1, curlwp);
 
 	mutex_enter(&sc->sc_mutex);
 
 	if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
 		error = EAGAIN;
 	else {
 		error = (*sc->sc_start)(sc, bp);
 		if (error == 0)
 			sc->sc_queuecnt++;
 	}
 
 	mutex_exit(&sc->sc_mutex);
 
 	if ((sc->sc_flags & LDF_MPSAFE) == 0)
 		KERNEL_UNLOCK_ONE(curlwp);
 
 	return error;
 }
 
 
 
 the first sc_queuecnt check here is done without holding any locks,
 so it could race with lddone() decrementing sc_queuecnt and return EAGAIN,
 which will cause dk_start() to stop sending new I/O to the underlying
 nvme driver even though the driver queue is no longer full.
 lddone() calls dk_start(), but the buffer that ld_diskstart() is rejecting
 is not on the bufq at that point so the lddone() thread wouldn't know about it.
 just removing the unlocked check of sc_queuecnt here would fix this race.
 
 however, it seems unlikely that this race would result in more than 1 I/O
 becoming stuck in the bufq, and in this dump we have a bunch of them,
 so it seems very likely that there are more problems that I haven't found yet.
 
 the logic in dk_start() that allows two threads to be processing the bufq
 in parallel looks like it could result in losing track of a buffer entirely.
 if both threads get EAGAIN from d_diskstart then then will both try to
 save the buffer pointer in dksc->sc_deferred, and the first one will be lost.
 but that isn't the problem with the two buffers above since they are both
 still on the bufq.  the comment says this special hidey hole for one buffer
 is intended to keep the order of I/Os the same, but that isn't needed for
 correctness, and doing that in the dk layer rather than in the bufq layer
 seems poor.  if we want a way to undo a bufq_get() rather than just putting
 the buffer back on the bufq with bufq_put(), then we should add a new API
 for that rather than making callers deal with it.
 
 (for nvme especially it would be better to not require putting all buffers
 for a device on a single bufq at all, since that rather reduces the benefit
 of the hardware having multiple hardware queues.  but that kind of design
 change is beyond the scope of the issue at hand.)
 
 I'll keep looking for more bugs that could have caused this.
 
 -Chuck
Prev by Date: Re: kern/52769: hang with an ffs stored in an nvme device
Next by Date: Re: kern/52769 (hang with an ffs stored in an nvme device)
Previous by Thread: Re: misc/52776 (acronyms: add definitions for EA, MVP)
Next by Thread: Re: kern/52769: hang with an ffs stored in an nvme device
Indexes:
Home | Main Index | Thread Index | Old Index