NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/52769: hang with an ffs stored in an nvme device



The following reply was made to PR kern/52769; it has been noted by GNATS.

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek%gmail.com@localhost>
To: Chuck Silvers <chuq%chuq.com@localhost>
Cc: gnats-bugs%netbsd.org@localhost, gnats-admin%netbsd.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Mon, 4 Dec 2017 23:20:14 +0100

 --089e0826a1085ad7e1055f8b1faa
 Content-Type: text/plain; charset="UTF-8"
 
 Hello Check,
 
 race WRT sc_deferred shouldn't be possible as noted by Michael. But I've
 added there a KASSERT() now. Can you retry with rev. 1.101 of
 sys/dev/dksubr.c to rule out that as the issue?
 
 Jaromir
 
 2017-12-01 19:45 GMT+01:00 Chuck Silvers <chuq%chuq.com@localhost>:
 
 > I got a copy of the dump from martin and here are some things I found.
 > the two buffers that he pointed out are for different parts of the same
 > device:
 >
 > (gdb) p *(struct buf *)0xffffe4041d8f7240
 > $4 = {
 >   b_u = {
 >     u_actq = {
 >       tqe_next = 0xffffe4040e6b6c68,
 >       tqe_prev = 0xffffe4029c033ea0
 >     },
 >     u_rbnode = {
 >       rb_nodes = {0xffffe4040e6b6c68, 0xffffe4029c033ea0},
 >       rb_info = 0x0
 >     },
 >     u_work = {
 >       wk_dummy = 0xffffe4040e6b6c68
 >     }
 >   },
 >   b_iodone = 0x0,
 >   b_error = 0x0,
 >   b_resid = 0x388e,
 >   b_flags = 0x404,
 >
 > #define B_ASYNC         0x00000004      /* Start I/O, do not wait. */
 > #define B_COWDONE       0x00000400      /* Copy-on-write already done. */
 >
 >   b_prio = 0x1,
 >   b_bufsize = 0x8000,
 >   b_bcount = 0x8000,
 >   b_dev = 0x1300,
 >   b_data = 0xffff800148420000,
 >   b_blkno = 0xddd06c0,
 >   b_rawblkno = 0xddd0ec0,
 >   b_proc = 0x0,
 >   b_saveaddr = 0x0,
 >   b_private = 0x0,
 >   b_dcookie = 0x0,
 >   b_busy = {
 >     cv_opaque = {0xffffe4040092db80, 0xffffe4040092dc90,
 > 0xffffffff8116c394}
 >   },
 >   b_refcnt = 0x2,
 >   b_unused = 0x0,
 >   b_hash = {
 >     le_next = 0x0,
 >     le_prev = 0xffffe403f8913f40
 >   },
 >   b_vnbufs = {
 >     le_next = 0xffffe4029c033ea0,
 >     le_prev = 0xffffe4040e6b6d18
 >   },
 >   b_freelist = {
 >     tqe_next = 0xffffe4041739a6c0,
 >     tqe_prev = 0xffffe403b883c308
 >   },
 >   b_wapbllist = {
 >     tqe_next = 0x0,
 >     tqe_prev = 0x0
 >   },
 >   b_lblkno = 0xddd06c0,
 >   b_freelistindex = 0x1,
 >   b_cflags = 0x4800010,
 >
 > #define BC_BUSY         0x00000010      /* I/O in progress. */
 > #define BC_WANTED       0x00800000      /* Process wants this buffer. */
 > #define BC_VFLUSH       0x04000000      /* Buffer is being synced. */
 >
 >   b_vp = 0xffffe4042b5b2bd0,
 >   b_done = {
 >     cv_opaque = {0x0, 0xffffe4041d8f7338, 0xffffffff8116c39c}
 >   },
 >   b_oflags = 0x0,
 >   b_objlock = 0xffffe4042b619640
 > }
 >
 >
 > (gdb) p *(struct buf *)0xffffe40294774480
 > $5 = {
 >   b_u = {
 >     u_actq = {
 >       tqe_next = 0xffffe403b48ac368,
 >       tqe_prev = 0xffffe403b48ac488
 >     },
 >     u_rbnode = {
 >       rb_nodes = {0xffffe403b48ac368, 0xffffe403b48ac488},
 >       rb_info = 0x0
 >     },
 >     u_work = {
 >       wk_dummy = 0xffffe403b48ac368
 >     }
 >   },
 >   b_iodone = 0x0,
 >   b_error = 0x0,
 >   b_resid = 0x49ec,
 >   b_flags = 0x404,
 >
 > #define B_ASYNC         0x00000004      /* Start I/O, do not wait. */
 > #define B_COWDONE       0x00000400      /* Copy-on-write already done. */
 >
 >   b_prio = 0x1,
 >   b_bufsize = 0x8000,
 >   b_bcount = 0x8000,
 >   b_dev = 0x1300,
 >   b_data = 0xffff80014cfb0000,
 >   b_blkno = 0x121f1100,
 >   b_rawblkno = 0x121f1900,
 >   b_proc = 0x0,
 >   b_saveaddr = 0x0,
 >   b_private = 0x0,
 >   b_dcookie = 0x0,
 >   b_busy = {
 >     cv_opaque = {0xffffe403ccf032c0, 0xffffe403ccf033d0,
 > 0xffffffff8116c394}
 >   },
 >   b_refcnt = 0x2,
 >   b_unused = 0x0,
 >   b_hash = {
 >     le_next = 0xffffe403e8c7b480,
 >     le_prev = 0xffff80001d75c158
 >   },
 >   b_vnbufs = {
 >     le_next = 0xffffe403b48ac488,
 >     le_prev = 0xffffe403b48ac418
 >   },
 >   b_freelist = {
 >     tqe_next = 0xffffe40412c51120,
 >     tqe_prev = 0xffffe4041739a300
 >   },
 >   b_wapbllist = {
 >     tqe_next = 0x0,
 >     tqe_prev = 0x0
 >   },
 >   b_lblkno = 0x121f1100,
 >   b_freelistindex = 0x1,
 >   b_cflags = 0x4800010,
 >
 > #define BC_BUSY         0x00000010      /* I/O in progress. */
 > #define BC_WANTED       0x00800000      /* Process wants this buffer. */
 > #define BC_VFLUSH       0x04000000      /* Buffer is being synced. */
 >
 >   b_vp = 0xffffe4042b5b2bd0,
 >   b_done = {
 >     cv_opaque = {0x0, 0xffffe40294774578, 0xffffffff8116c39c}
 >   },
 >   b_oflags = 0x0,
 >   b_objlock = 0xffffe4042b619640
 > }
 >
 >
 >
 > b_dev = 0x1300 is "ld0a", which is associated with nvme0 as martin
 > mentioned.
 > the buffers are B_ASYNC, so they will be unlocked automatically by
 > biodone().
 > however the nvme driver does not have any I/O pending:
 >
 > (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_admin_q
 > $9 = {
 >   q_sc = 0xffff80001da34000,
 >   q_sq_mtx = {
 >     u = {
 >       mtxa_owner = 0x601
 >     }
 >   },
 >   q_cq_mtx = {
 >     u = {
 >       mtxa_owner = 0x601
 >     }
 >   },
 >   q_sq_dmamem = 0xffffe4042f7ceac8,
 >   q_cq_dmamem = 0xffffe4042f7ceaf0,
 >   q_sqtdbl = 0x1000,
 >   q_cqhdbl = 0x1004,
 >   q_id = 0x0,
 >   q_entries = 0x20,
 >   q_sq_tail = 0x4,
 >   q_cq_head = 0x4,
 >   q_cq_phase = 0x1,
 >   q_ccb_mtx = {
 >     u = {
 >       mtxa_owner = 0x601
 >     }
 >   },
 >   q_nccbs = 0x20,
 >   q_nccbs_avail = 0x21,
 >   q_ccbs = 0xffffe4011d6e6000,
 >   q_ccb_list = {
 >     sqh_first = 0xffffe4011d6e6000,
 >     sqh_last = 0xffffe4011d6e6e88
 >   },
 >   q_ccb_prpls = 0xffffe4042f7ceb18
 > }
 >
 > (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_q[0]
 > $12 = {
 >   q_sc = 0xffff80001da34000,
 >   q_sq_mtx = {
 >     u = {
 >       mtxa_owner = 0x601
 >     }
 >   },
 >   q_cq_mtx = {
 >     u = {
 >       mtxa_owner = 0x601
 >     }
 >   },
 >   q_sq_dmamem = 0xffffe4042f7ceb40,
 >   q_cq_dmamem = 0xffffe4042f7ceb68,
 >   q_sqtdbl = 0x1008,
 >   q_cqhdbl = 0x100c,
 >   q_id = 0x1,
 >   q_entries = 0x400,
 >   q_sq_tail = 0x3b7,
 >   q_cq_head = 0x3b7,
 >   q_cq_phase = 0x0,
 >   q_ccb_mtx = {
 >     u = {
 >       mtxa_owner = 0x601
 >     }
 >   },
 >   q_nccbs = 0x3ff,
 >   q_nccbs_avail = 0x3ff,
 >   q_ccbs = 0xffff800020d5b000,
 >   q_ccb_list = {
 >     sqh_first = 0xffff800020d78cb8,
 >     sqh_last = 0xffff800020d78f10
 >   },
 >   q_ccb_prpls = 0xffffe4042f7ceb90
 > }
 >
 >
 > the one normal queue is empty (q_nccbs == q_nccbs_avail, and q_sq_tail ==
 > q_cq_head).
 > the admin queue is a bit odd in that q_nccbs_avail is one greater than
 > q_nccbs,
 > but it does appear to also be empty since q_sq_tail == q_cq_head.
 >
 >
 > the two buffers above, along with 554 others, are still in the bufq of ld0:
 >
 > (gdb) p ((struct ld_softc *)ld_cd.cd_devs[0]->dv_private)->sc_dksc.sc_bufq
 > $1 = (struct bufq_state *) 0xffffe4042f686ce8
 > (gdb) xbufq_fcfs 0xffffe4042f686ce8
 > bufq 0xffffe4042f686ce8
 > 0xffffe4025ff23b48
 > 0xffffe4025ff23a28
 > 0xffffe4025ff23908
 > 0xffffe4025ff237e8
 > 0xffffe4025ff236c8
 > ...
 > 0xffffe403b23b4480
 > 0xffffe403cfef36c0
 > total buffers 556
 >
 >
 > ld0's dk_softc also indicates that no buffers are currently in the hands of
 > the underlying driver:
 >
 > (gdb) p *(struct ld_softc *)ld_cd.cd_devs[0]->dv_private
 > $14 = {
 >   sc_dksc = {
 > ...
 >     sc_bufq = 0xffffe4042f686ce8,
 >     sc_dtype = 0xf,
 >     sc_deferred = 0xffffe4025ff23c68,
 >     sc_busy = 0x0,
 > ...
 >   sc_queuecnt = 0x0,
 > ...
 >   sc_maxqueuecnt = 0x3ff,
 > }
 >
 >
 > so the bug appears to be that the dk layer is sitting on a bunch of buffers
 > even though the underlying nvme driver is idle.
 >
 > one problem I see is in ld_diskstart():
 >
 > static int
 > ld_diskstart(device_t dev, struct buf *bp)
 > {
 >         struct ld_softc *sc = device_private(dev);
 >         int error;
 >
 >         if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
 >                 return EAGAIN;
 >
 >         if ((sc->sc_flags & LDF_MPSAFE) == 0)
 >                 KERNEL_LOCK(1, curlwp);
 >
 >         mutex_enter(&sc->sc_mutex);
 >
 >         if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
 >                 error = EAGAIN;
 >         else {
 >                 error = (*sc->sc_start)(sc, bp);
 >                 if (error == 0)
 >                         sc->sc_queuecnt++;
 >         }
 >
 >         mutex_exit(&sc->sc_mutex);
 >
 >         if ((sc->sc_flags & LDF_MPSAFE) == 0)
 >                 KERNEL_UNLOCK_ONE(curlwp);
 >
 >         return error;
 > }
 >
 >
 >
 > the first sc_queuecnt check here is done without holding any locks,
 > so it could race with lddone() decrementing sc_queuecnt and return EAGAIN,
 > which will cause dk_start() to stop sending new I/O to the underlying
 > nvme driver even though the driver queue is no longer full.
 > lddone() calls dk_start(), but the buffer that ld_diskstart() is rejecting
 > is not on the bufq at that point so the lddone() thread wouldn't know
 > about it.
 > just removing the unlocked check of sc_queuecnt here would fix this race.
 >
 > however, it seems unlikely that this race would result in more than 1 I/O
 > becoming stuck in the bufq, and in this dump we have a bunch of them,
 > so it seems very likely that there are more problems that I haven't found
 > yet.
 >
 > the logic in dk_start() that allows two threads to be processing the bufq
 > in parallel looks like it could result in losing track of a buffer
 > entirely.
 > if both threads get EAGAIN from d_diskstart then then will both try to
 > save the buffer pointer in dksc->sc_deferred, and the first one will be
 > lost.
 > but that isn't the problem with the two buffers above since they are both
 > still on the bufq.  the comment says this special hidey hole for one buffer
 > is intended to keep the order of I/Os the same, but that isn't needed for
 > correctness, and doing that in the dk layer rather than in the bufq layer
 > seems poor.  if we want a way to undo a bufq_get() rather than just putting
 > the buffer back on the bufq with bufq_put(), then we should add a new API
 > for that rather than making callers deal with it.
 >
 > (for nvme especially it would be better to not require putting all buffers
 > for a device on a single bufq at all, since that rather reduces the benefit
 > of the hardware having multiple hardware queues.  but that kind of design
 > change is beyond the scope of the issue at hand.)
 >
 > I'll keep looking for more bugs that could have caused this.
 >
 > -Chuck
 >
 
 --089e0826a1085ad7e1055f8b1faa
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable
 
 <div dir=3D"ltr">Hello Check,<div><br></div><div>race WRT sc_deferred shoul=
 dn&#39;t be possible as noted by Michael. But I&#39;ve added there a KASSER=
 T() now. Can you retry with rev. 1.101 of sys/dev/dksubr.c to rule out that=
  as the issue?</div><div><br></div><div>Jaromir<br></div></div><div class=
 =3D"gmail_extra"><br><div class=3D"gmail_quote">2017-12-01 19:45 GMT+01:00 =
 Chuck Silvers <span dir=3D"ltr">&lt;<a href=3D"mailto:chuq%chuq.com@localhost"; target=
 =3D"_blank">chuq%chuq.com@localhost</a>&gt;</span>:<br><blockquote class=3D"gmail_quo=
 te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
 >I got a copy of the dump from martin and here are some things I found.<br>
 the two buffers that he pointed out are for different parts of the same dev=
 ice:<br>
 <br>
 (gdb) p *(struct buf *)0xffffe4041d8f7240<br>
 $4 =3D {<br>
 =C2=A0 b_u =3D {<br>
 =C2=A0 =C2=A0 u_actq =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 tqe_next =3D 0xffffe4040e6b6c68,<br>
 =C2=A0 =C2=A0 =C2=A0 tqe_prev =3D 0xffffe4029c033ea0<br>
 =C2=A0 =C2=A0 },<br>
 =C2=A0 =C2=A0 u_rbnode =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 rb_nodes =3D {0xffffe4040e6b6c68, 0xffffe4029c033ea0},=
 <br>
 =C2=A0 =C2=A0 =C2=A0 rb_info =3D 0x0<br>
 =C2=A0 =C2=A0 },<br>
 =C2=A0 =C2=A0 u_work =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 wk_dummy =3D 0xffffe4040e6b6c68<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 b_iodone =3D 0x0,<br>
 =C2=A0 b_error =3D 0x0,<br>
 =C2=A0 b_resid =3D 0x388e,<br>
 =C2=A0 b_flags =3D 0x404,<br>
 <br>
 #define B_ASYNC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000004=C2=A0 =C2=A0 =
 =C2=A0 /* Start I/O, do not wait. */<br>
 #define B_COWDONE=C2=A0 =C2=A0 =C2=A0 =C2=A00x00000400=C2=A0 =C2=A0 =C2=A0 =
 /* Copy-on-write already done. */<br>
 <br>
 =C2=A0 b_prio =3D 0x1,<br>
 =C2=A0 b_bufsize =3D 0x8000,<br>
 =C2=A0 b_bcount =3D 0x8000,<br>
 =C2=A0 b_dev =3D 0x1300,<br>
 =C2=A0 b_data =3D 0xffff800148420000,<br>
 =C2=A0 b_blkno =3D 0xddd06c0,<br>
 =C2=A0 b_rawblkno =3D 0xddd0ec0,<br>
 =C2=A0 b_proc =3D 0x0,<br>
 =C2=A0 b_saveaddr =3D 0x0,<br>
 =C2=A0 b_private =3D 0x0,<br>
 =C2=A0 b_dcookie =3D 0x0,<br>
 =C2=A0 b_busy =3D {<br>
 =C2=A0 =C2=A0 cv_opaque =3D {0xffffe4040092db80, 0xffffe4040092dc90, 0xffff=
 ffff8116c394}<br>
 =C2=A0 },<br>
 =C2=A0 b_refcnt =3D 0x2,<br>
 =C2=A0 b_unused =3D 0x0,<br>
 =C2=A0 b_hash =3D {<br>
 =C2=A0 =C2=A0 le_next =3D 0x0,<br>
 =C2=A0 =C2=A0 le_prev =3D 0xffffe403f8913f40<br>
 =C2=A0 },<br>
 =C2=A0 b_vnbufs =3D {<br>
 =C2=A0 =C2=A0 le_next =3D 0xffffe4029c033ea0,<br>
 =C2=A0 =C2=A0 le_prev =3D 0xffffe4040e6b6d18<br>
 =C2=A0 },<br>
 =C2=A0 b_freelist =3D {<br>
 =C2=A0 =C2=A0 tqe_next =3D 0xffffe4041739a6c0,<br>
 =C2=A0 =C2=A0 tqe_prev =3D 0xffffe403b883c308<br>
 =C2=A0 },<br>
 =C2=A0 b_wapbllist =3D {<br>
 =C2=A0 =C2=A0 tqe_next =3D 0x0,<br>
 =C2=A0 =C2=A0 tqe_prev =3D 0x0<br>
 =C2=A0 },<br>
 =C2=A0 b_lblkno =3D 0xddd06c0,<br>
 =C2=A0 b_freelistindex =3D 0x1,<br>
 =C2=A0 b_cflags =3D 0x4800010,<br>
 <br>
 #define BC_BUSY=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000010=C2=A0 =C2=A0 =
 =C2=A0 /* I/O in progress. */<br>
 #define BC_WANTED=C2=A0 =C2=A0 =C2=A0 =C2=A00x00800000=C2=A0 =C2=A0 =C2=A0 =
 /* Process wants this buffer. */<br>
 #define BC_VFLUSH=C2=A0 =C2=A0 =C2=A0 =C2=A00x04000000=C2=A0 =C2=A0 =C2=A0 =
 /* Buffer is being synced. */<br>
 <br>
 =C2=A0 b_vp =3D 0xffffe4042b5b2bd0,<br>
 =C2=A0 b_done =3D {<br>
 =C2=A0 =C2=A0 cv_opaque =3D {0x0, 0xffffe4041d8f7338, 0xffffffff8116c39c}<b=
 r>
 =C2=A0 },<br>
 =C2=A0 b_oflags =3D 0x0,<br>
 =C2=A0 b_objlock =3D 0xffffe4042b619640<br>
 }<br>
 <br>
 <br>
 (gdb) p *(struct buf *)0xffffe40294774480<br>
 $5 =3D {<br>
 =C2=A0 b_u =3D {<br>
 =C2=A0 =C2=A0 u_actq =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 tqe_next =3D 0xffffe403b48ac368,<br>
 =C2=A0 =C2=A0 =C2=A0 tqe_prev =3D 0xffffe403b48ac488<br>
 =C2=A0 =C2=A0 },<br>
 =C2=A0 =C2=A0 u_rbnode =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 rb_nodes =3D {0xffffe403b48ac368, 0xffffe403b48ac488},=
 <br>
 =C2=A0 =C2=A0 =C2=A0 rb_info =3D 0x0<br>
 =C2=A0 =C2=A0 },<br>
 =C2=A0 =C2=A0 u_work =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 wk_dummy =3D 0xffffe403b48ac368<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 b_iodone =3D 0x0,<br>
 =C2=A0 b_error =3D 0x0,<br>
 =C2=A0 b_resid =3D 0x49ec,<br>
 =C2=A0 b_flags =3D 0x404,<br>
 <br>
 #define B_ASYNC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000004=C2=A0 =C2=A0 =
 =C2=A0 /* Start I/O, do not wait. */<br>
 #define B_COWDONE=C2=A0 =C2=A0 =C2=A0 =C2=A00x00000400=C2=A0 =C2=A0 =C2=A0 =
 /* Copy-on-write already done. */<br>
 <br>
 =C2=A0 b_prio =3D 0x1,<br>
 =C2=A0 b_bufsize =3D 0x8000,<br>
 =C2=A0 b_bcount =3D 0x8000,<br>
 =C2=A0 b_dev =3D 0x1300,<br>
 =C2=A0 b_data =3D 0xffff80014cfb0000,<br>
 =C2=A0 b_blkno =3D 0x121f1100,<br>
 =C2=A0 b_rawblkno =3D 0x121f1900,<br>
 =C2=A0 b_proc =3D 0x0,<br>
 =C2=A0 b_saveaddr =3D 0x0,<br>
 =C2=A0 b_private =3D 0x0,<br>
 =C2=A0 b_dcookie =3D 0x0,<br>
 =C2=A0 b_busy =3D {<br>
 =C2=A0 =C2=A0 cv_opaque =3D {0xffffe403ccf032c0, 0xffffe403ccf033d0, 0xffff=
 ffff8116c394}<br>
 =C2=A0 },<br>
 =C2=A0 b_refcnt =3D 0x2,<br>
 =C2=A0 b_unused =3D 0x0,<br>
 =C2=A0 b_hash =3D {<br>
 =C2=A0 =C2=A0 le_next =3D 0xffffe403e8c7b480,<br>
 =C2=A0 =C2=A0 le_prev =3D 0xffff80001d75c158<br>
 =C2=A0 },<br>
 =C2=A0 b_vnbufs =3D {<br>
 =C2=A0 =C2=A0 le_next =3D 0xffffe403b48ac488,<br>
 =C2=A0 =C2=A0 le_prev =3D 0xffffe403b48ac418<br>
 =C2=A0 },<br>
 =C2=A0 b_freelist =3D {<br>
 =C2=A0 =C2=A0 tqe_next =3D 0xffffe40412c51120,<br>
 =C2=A0 =C2=A0 tqe_prev =3D 0xffffe4041739a300<br>
 =C2=A0 },<br>
 =C2=A0 b_wapbllist =3D {<br>
 =C2=A0 =C2=A0 tqe_next =3D 0x0,<br>
 =C2=A0 =C2=A0 tqe_prev =3D 0x0<br>
 =C2=A0 },<br>
 =C2=A0 b_lblkno =3D 0x121f1100,<br>
 =C2=A0 b_freelistindex =3D 0x1,<br>
 =C2=A0 b_cflags =3D 0x4800010,<br>
 <br>
 #define BC_BUSY=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000010=C2=A0 =C2=A0 =
 =C2=A0 /* I/O in progress. */<br>
 #define BC_WANTED=C2=A0 =C2=A0 =C2=A0 =C2=A00x00800000=C2=A0 =C2=A0 =C2=A0 =
 /* Process wants this buffer. */<br>
 #define BC_VFLUSH=C2=A0 =C2=A0 =C2=A0 =C2=A00x04000000=C2=A0 =C2=A0 =C2=A0 =
 /* Buffer is being synced. */<br>
 <br>
 =C2=A0 b_vp =3D 0xffffe4042b5b2bd0,<br>
 =C2=A0 b_done =3D {<br>
 =C2=A0 =C2=A0 cv_opaque =3D {0x0, 0xffffe40294774578, 0xffffffff8116c39c}<b=
 r>
 =C2=A0 },<br>
 =C2=A0 b_oflags =3D 0x0,<br>
 =C2=A0 b_objlock =3D 0xffffe4042b619640<br>
 }<br>
 <br>
 <br>
 <br>
 b_dev =3D 0x1300 is &quot;ld0a&quot;, which is associated with nvme0 as mar=
 tin mentioned.<br>
 the buffers are B_ASYNC, so they will be unlocked automatically by biodone(=
 ).<br>
 however the nvme driver does not have any I/O pending:<br>
 <br>
 (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]-&gt;dv_<wbr>private)-&gt;=
 sc_admin_q<br>
 $9 =3D {<br>
 =C2=A0 q_sc =3D 0xffff80001da34000,<br>
 =C2=A0 q_sq_mtx =3D {<br>
 =C2=A0 =C2=A0 u =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 q_cq_mtx =3D {<br>
 =C2=A0 =C2=A0 u =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 q_sq_dmamem =3D 0xffffe4042f7ceac8,<br>
 =C2=A0 q_cq_dmamem =3D 0xffffe4042f7ceaf0,<br>
 =C2=A0 q_sqtdbl =3D 0x1000,<br>
 =C2=A0 q_cqhdbl =3D 0x1004,<br>
 =C2=A0 q_id =3D 0x0,<br>
 =C2=A0 q_entries =3D 0x20,<br>
 =C2=A0 q_sq_tail =3D 0x4,<br>
 =C2=A0 q_cq_head =3D 0x4,<br>
 =C2=A0 q_cq_phase =3D 0x1,<br>
 =C2=A0 q_ccb_mtx =3D {<br>
 =C2=A0 =C2=A0 u =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 q_nccbs =3D 0x20,<br>
 =C2=A0 q_nccbs_avail =3D 0x21,<br>
 =C2=A0 q_ccbs =3D 0xffffe4011d6e6000,<br>
 =C2=A0 q_ccb_list =3D {<br>
 =C2=A0 =C2=A0 sqh_first =3D 0xffffe4011d6e6000,<br>
 =C2=A0 =C2=A0 sqh_last =3D 0xffffe4011d6e6e88<br>
 =C2=A0 },<br>
 =C2=A0 q_ccb_prpls =3D 0xffffe4042f7ceb18<br>
 }<br>
 <br>
 (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]-&gt;dv_<wbr>private)-&gt;=
 sc_q[0]<br>
 $12 =3D {<br>
 =C2=A0 q_sc =3D 0xffff80001da34000,<br>
 =C2=A0 q_sq_mtx =3D {<br>
 =C2=A0 =C2=A0 u =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 q_cq_mtx =3D {<br>
 =C2=A0 =C2=A0 u =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 q_sq_dmamem =3D 0xffffe4042f7ceb40,<br>
 =C2=A0 q_cq_dmamem =3D 0xffffe4042f7ceb68,<br>
 =C2=A0 q_sqtdbl =3D 0x1008,<br>
 =C2=A0 q_cqhdbl =3D 0x100c,<br>
 =C2=A0 q_id =3D 0x1,<br>
 =C2=A0 q_entries =3D 0x400,<br>
 =C2=A0 q_sq_tail =3D 0x3b7,<br>
 =C2=A0 q_cq_head =3D 0x3b7,<br>
 =C2=A0 q_cq_phase =3D 0x0,<br>
 =C2=A0 q_ccb_mtx =3D {<br>
 =C2=A0 =C2=A0 u =3D {<br>
 =C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
 =C2=A0 =C2=A0 }<br>
 =C2=A0 },<br>
 =C2=A0 q_nccbs =3D 0x3ff,<br>
 =C2=A0 q_nccbs_avail =3D 0x3ff,<br>
 =C2=A0 q_ccbs =3D 0xffff800020d5b000,<br>
 =C2=A0 q_ccb_list =3D {<br>
 =C2=A0 =C2=A0 sqh_first =3D 0xffff800020d78cb8,<br>
 =C2=A0 =C2=A0 sqh_last =3D 0xffff800020d78f10<br>
 =C2=A0 },<br>
 =C2=A0 q_ccb_prpls =3D 0xffffe4042f7ceb90<br>
 }<br>
 <br>
 <br>
 the one normal queue is empty (q_nccbs =3D=3D q_nccbs_avail, and q_sq_tail =
 =3D=3D q_cq_head).<br>
 the admin queue is a bit odd in that q_nccbs_avail is one greater than q_nc=
 cbs,<br>
 but it does appear to also be empty since q_sq_tail =3D=3D q_cq_head.<br>
 <br>
 <br>
 the two buffers above, along with 554 others, are still in the bufq of ld0:=
 <br>
 <br>
 (gdb) p ((struct ld_softc *)ld_cd.cd_devs[0]-&gt;dv_<wbr>private)-&gt;sc_dk=
 sc.sc_bufq<br>
 $1 =3D (struct bufq_state *) 0xffffe4042f686ce8<br>
 (gdb) xbufq_fcfs 0xffffe4042f686ce8<br>
 bufq 0xffffe4042f686ce8<br>
 0xffffe4025ff23b48<br>
 0xffffe4025ff23a28<br>
 0xffffe4025ff23908<br>
 0xffffe4025ff237e8<br>
 0xffffe4025ff236c8<br>
 ...<br>
 0xffffe403b23b4480<br>
 0xffffe403cfef36c0<br>
 total buffers 556<br>
 <br>
 <br>
 ld0&#39;s dk_softc also indicates that no buffers are currently in the hand=
 s of<br>
 the underlying driver:<br>
 <br>
 (gdb) p *(struct ld_softc *)ld_cd.cd_devs[0]-&gt;dv_private<br>
 $14 =3D {<br>
 =C2=A0 sc_dksc =3D {<br>
 ...<br>
 =C2=A0 =C2=A0 sc_bufq =3D 0xffffe4042f686ce8,<br>
 =C2=A0 =C2=A0 sc_dtype =3D 0xf,<br>
 =C2=A0 =C2=A0 sc_deferred =3D 0xffffe4025ff23c68,<br>
 =C2=A0 =C2=A0 sc_busy =3D 0x0,<br>
 ...<br>
 =C2=A0 sc_queuecnt =3D 0x0,<br>
 ...<br>
 =C2=A0 sc_maxqueuecnt =3D 0x3ff,<br>
 }<br>
 <br>
 <br>
 so the bug appears to be that the dk layer is sitting on a bunch of buffers=
 <br>
 even though the underlying nvme driver is idle.<br>
 <br>
 one problem I see is in ld_diskstart():<br>
 <br>
 static int<br>
 ld_diskstart(device_t dev, struct buf *bp)<br>
 {<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct ld_softc *sc =3D device_private(dev);<br=
 >
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 int error;<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (sc-&gt;sc_queuecnt &gt;=3D sc-&gt;sc_maxque=
 uecnt)<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return EAGAIN;<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if ((sc-&gt;sc_flags &amp; LDF_MPSAFE) =3D=3D 0=
 )<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 KERNEL_LOCK(1, curl=
 wp);<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_enter(&amp;sc-&gt;sc_mutex);<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (sc-&gt;sc_queuecnt &gt;=3D sc-&gt;sc_maxque=
 uecnt)<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 error =3D EAGAIN;<b=
 r>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else {<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 error =3D (*sc-&gt;=
 sc_start)(sc, bp);<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (error =3D=3D 0)=
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 =C2=A0 sc-&gt;sc_queuecnt++;<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_exit(&amp;sc-&gt;sc_mutex);<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if ((sc-&gt;sc_flags &amp; LDF_MPSAFE) =3D=3D 0=
 )<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 KERNEL_UNLOCK_ONE(c=
 urlwp);<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return error;<br>
 }<br>
 <br>
 <br>
 <br>
 the first sc_queuecnt check here is done without holding any locks,<br>
 so it could race with lddone() decrementing sc_queuecnt and return EAGAIN,<=
 br>
 which will cause dk_start() to stop sending new I/O to the underlying<br>
 nvme driver even though the driver queue is no longer full.<br>
 lddone() calls dk_start(), but the buffer that ld_diskstart() is rejecting<=
 br>
 is not on the bufq at that point so the lddone() thread wouldn&#39;t know a=
 bout it.<br>
 just removing the unlocked check of sc_queuecnt here would fix this race.<b=
 r>
 <br>
 however, it seems unlikely that this race would result in more than 1 I/O<b=
 r>
 becoming stuck in the bufq, and in this dump we have a bunch of them,<br>
 so it seems very likely that there are more problems that I haven&#39;t fou=
 nd yet.<br>
 <br>
 the logic in dk_start() that allows two threads to be processing the bufq<b=
 r>
 in parallel looks like it could result in losing track of a buffer entirely=
 .<br>
 if both threads get EAGAIN from d_diskstart then then will both try to<br>
 save the buffer pointer in dksc-&gt;sc_deferred, and the first one will be =
 lost.<br>
 but that isn&#39;t the problem with the two buffers above since they are bo=
 th<br>
 still on the bufq.=C2=A0 the comment says this special hidey hole for one b=
 uffer<br>
 is intended to keep the order of I/Os the same, but that isn&#39;t needed f=
 or<br>
 correctness, and doing that in the dk layer rather than in the bufq layer<b=
 r>
 seems poor.=C2=A0 if we want a way to undo a bufq_get() rather than just pu=
 tting<br>
 the buffer back on the bufq with bufq_put(), then we should add a new API<b=
 r>
 for that rather than making callers deal with it.<br>
 <br>
 (for nvme especially it would be better to not require putting all buffers<=
 br>
 for a device on a single bufq at all, since that rather reduces the benefit=
 <br>
 of the hardware having multiple hardware queues.=C2=A0 but that kind of des=
 ign<br>
 change is beyond the scope of the issue at hand.)<br>
 <br>
 I&#39;ll keep looking for more bugs that could have caused this.<br>
 <span class=3D"HOEnZb"><font color=3D"#888888"><br>
 -Chuck<br>
 </font></span></blockquote></div><br></div>
 
 --089e0826a1085ad7e1055f8b1faa--
 



Home | Main Index | Thread Index | Old Index