NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/52769: hang with an ffs stored in an nvme device
The following reply was made to PR kern/52769; it has been noted by GNATS.
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek%gmail.com@localhost>
To: Chuck Silvers <chuq%chuq.com@localhost>
Cc: gnats-bugs%netbsd.org@localhost, gnats-admin%netbsd.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Mon, 4 Dec 2017 23:20:14 +0100
--089e0826a1085ad7e1055f8b1faa
Content-Type: text/plain; charset="UTF-8"
Hello Check,
race WRT sc_deferred shouldn't be possible as noted by Michael. But I've
added there a KASSERT() now. Can you retry with rev. 1.101 of
sys/dev/dksubr.c to rule out that as the issue?
Jaromir
2017-12-01 19:45 GMT+01:00 Chuck Silvers <chuq%chuq.com@localhost>:
> I got a copy of the dump from martin and here are some things I found.
> the two buffers that he pointed out are for different parts of the same
> device:
>
> (gdb) p *(struct buf *)0xffffe4041d8f7240
> $4 = {
> b_u = {
> u_actq = {
> tqe_next = 0xffffe4040e6b6c68,
> tqe_prev = 0xffffe4029c033ea0
> },
> u_rbnode = {
> rb_nodes = {0xffffe4040e6b6c68, 0xffffe4029c033ea0},
> rb_info = 0x0
> },
> u_work = {
> wk_dummy = 0xffffe4040e6b6c68
> }
> },
> b_iodone = 0x0,
> b_error = 0x0,
> b_resid = 0x388e,
> b_flags = 0x404,
>
> #define B_ASYNC 0x00000004 /* Start I/O, do not wait. */
> #define B_COWDONE 0x00000400 /* Copy-on-write already done. */
>
> b_prio = 0x1,
> b_bufsize = 0x8000,
> b_bcount = 0x8000,
> b_dev = 0x1300,
> b_data = 0xffff800148420000,
> b_blkno = 0xddd06c0,
> b_rawblkno = 0xddd0ec0,
> b_proc = 0x0,
> b_saveaddr = 0x0,
> b_private = 0x0,
> b_dcookie = 0x0,
> b_busy = {
> cv_opaque = {0xffffe4040092db80, 0xffffe4040092dc90,
> 0xffffffff8116c394}
> },
> b_refcnt = 0x2,
> b_unused = 0x0,
> b_hash = {
> le_next = 0x0,
> le_prev = 0xffffe403f8913f40
> },
> b_vnbufs = {
> le_next = 0xffffe4029c033ea0,
> le_prev = 0xffffe4040e6b6d18
> },
> b_freelist = {
> tqe_next = 0xffffe4041739a6c0,
> tqe_prev = 0xffffe403b883c308
> },
> b_wapbllist = {
> tqe_next = 0x0,
> tqe_prev = 0x0
> },
> b_lblkno = 0xddd06c0,
> b_freelistindex = 0x1,
> b_cflags = 0x4800010,
>
> #define BC_BUSY 0x00000010 /* I/O in progress. */
> #define BC_WANTED 0x00800000 /* Process wants this buffer. */
> #define BC_VFLUSH 0x04000000 /* Buffer is being synced. */
>
> b_vp = 0xffffe4042b5b2bd0,
> b_done = {
> cv_opaque = {0x0, 0xffffe4041d8f7338, 0xffffffff8116c39c}
> },
> b_oflags = 0x0,
> b_objlock = 0xffffe4042b619640
> }
>
>
> (gdb) p *(struct buf *)0xffffe40294774480
> $5 = {
> b_u = {
> u_actq = {
> tqe_next = 0xffffe403b48ac368,
> tqe_prev = 0xffffe403b48ac488
> },
> u_rbnode = {
> rb_nodes = {0xffffe403b48ac368, 0xffffe403b48ac488},
> rb_info = 0x0
> },
> u_work = {
> wk_dummy = 0xffffe403b48ac368
> }
> },
> b_iodone = 0x0,
> b_error = 0x0,
> b_resid = 0x49ec,
> b_flags = 0x404,
>
> #define B_ASYNC 0x00000004 /* Start I/O, do not wait. */
> #define B_COWDONE 0x00000400 /* Copy-on-write already done. */
>
> b_prio = 0x1,
> b_bufsize = 0x8000,
> b_bcount = 0x8000,
> b_dev = 0x1300,
> b_data = 0xffff80014cfb0000,
> b_blkno = 0x121f1100,
> b_rawblkno = 0x121f1900,
> b_proc = 0x0,
> b_saveaddr = 0x0,
> b_private = 0x0,
> b_dcookie = 0x0,
> b_busy = {
> cv_opaque = {0xffffe403ccf032c0, 0xffffe403ccf033d0,
> 0xffffffff8116c394}
> },
> b_refcnt = 0x2,
> b_unused = 0x0,
> b_hash = {
> le_next = 0xffffe403e8c7b480,
> le_prev = 0xffff80001d75c158
> },
> b_vnbufs = {
> le_next = 0xffffe403b48ac488,
> le_prev = 0xffffe403b48ac418
> },
> b_freelist = {
> tqe_next = 0xffffe40412c51120,
> tqe_prev = 0xffffe4041739a300
> },
> b_wapbllist = {
> tqe_next = 0x0,
> tqe_prev = 0x0
> },
> b_lblkno = 0x121f1100,
> b_freelistindex = 0x1,
> b_cflags = 0x4800010,
>
> #define BC_BUSY 0x00000010 /* I/O in progress. */
> #define BC_WANTED 0x00800000 /* Process wants this buffer. */
> #define BC_VFLUSH 0x04000000 /* Buffer is being synced. */
>
> b_vp = 0xffffe4042b5b2bd0,
> b_done = {
> cv_opaque = {0x0, 0xffffe40294774578, 0xffffffff8116c39c}
> },
> b_oflags = 0x0,
> b_objlock = 0xffffe4042b619640
> }
>
>
>
> b_dev = 0x1300 is "ld0a", which is associated with nvme0 as martin
> mentioned.
> the buffers are B_ASYNC, so they will be unlocked automatically by
> biodone().
> however the nvme driver does not have any I/O pending:
>
> (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_admin_q
> $9 = {
> q_sc = 0xffff80001da34000,
> q_sq_mtx = {
> u = {
> mtxa_owner = 0x601
> }
> },
> q_cq_mtx = {
> u = {
> mtxa_owner = 0x601
> }
> },
> q_sq_dmamem = 0xffffe4042f7ceac8,
> q_cq_dmamem = 0xffffe4042f7ceaf0,
> q_sqtdbl = 0x1000,
> q_cqhdbl = 0x1004,
> q_id = 0x0,
> q_entries = 0x20,
> q_sq_tail = 0x4,
> q_cq_head = 0x4,
> q_cq_phase = 0x1,
> q_ccb_mtx = {
> u = {
> mtxa_owner = 0x601
> }
> },
> q_nccbs = 0x20,
> q_nccbs_avail = 0x21,
> q_ccbs = 0xffffe4011d6e6000,
> q_ccb_list = {
> sqh_first = 0xffffe4011d6e6000,
> sqh_last = 0xffffe4011d6e6e88
> },
> q_ccb_prpls = 0xffffe4042f7ceb18
> }
>
> (gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_q[0]
> $12 = {
> q_sc = 0xffff80001da34000,
> q_sq_mtx = {
> u = {
> mtxa_owner = 0x601
> }
> },
> q_cq_mtx = {
> u = {
> mtxa_owner = 0x601
> }
> },
> q_sq_dmamem = 0xffffe4042f7ceb40,
> q_cq_dmamem = 0xffffe4042f7ceb68,
> q_sqtdbl = 0x1008,
> q_cqhdbl = 0x100c,
> q_id = 0x1,
> q_entries = 0x400,
> q_sq_tail = 0x3b7,
> q_cq_head = 0x3b7,
> q_cq_phase = 0x0,
> q_ccb_mtx = {
> u = {
> mtxa_owner = 0x601
> }
> },
> q_nccbs = 0x3ff,
> q_nccbs_avail = 0x3ff,
> q_ccbs = 0xffff800020d5b000,
> q_ccb_list = {
> sqh_first = 0xffff800020d78cb8,
> sqh_last = 0xffff800020d78f10
> },
> q_ccb_prpls = 0xffffe4042f7ceb90
> }
>
>
> the one normal queue is empty (q_nccbs == q_nccbs_avail, and q_sq_tail ==
> q_cq_head).
> the admin queue is a bit odd in that q_nccbs_avail is one greater than
> q_nccbs,
> but it does appear to also be empty since q_sq_tail == q_cq_head.
>
>
> the two buffers above, along with 554 others, are still in the bufq of ld0:
>
> (gdb) p ((struct ld_softc *)ld_cd.cd_devs[0]->dv_private)->sc_dksc.sc_bufq
> $1 = (struct bufq_state *) 0xffffe4042f686ce8
> (gdb) xbufq_fcfs 0xffffe4042f686ce8
> bufq 0xffffe4042f686ce8
> 0xffffe4025ff23b48
> 0xffffe4025ff23a28
> 0xffffe4025ff23908
> 0xffffe4025ff237e8
> 0xffffe4025ff236c8
> ...
> 0xffffe403b23b4480
> 0xffffe403cfef36c0
> total buffers 556
>
>
> ld0's dk_softc also indicates that no buffers are currently in the hands of
> the underlying driver:
>
> (gdb) p *(struct ld_softc *)ld_cd.cd_devs[0]->dv_private
> $14 = {
> sc_dksc = {
> ...
> sc_bufq = 0xffffe4042f686ce8,
> sc_dtype = 0xf,
> sc_deferred = 0xffffe4025ff23c68,
> sc_busy = 0x0,
> ...
> sc_queuecnt = 0x0,
> ...
> sc_maxqueuecnt = 0x3ff,
> }
>
>
> so the bug appears to be that the dk layer is sitting on a bunch of buffers
> even though the underlying nvme driver is idle.
>
> one problem I see is in ld_diskstart():
>
> static int
> ld_diskstart(device_t dev, struct buf *bp)
> {
> struct ld_softc *sc = device_private(dev);
> int error;
>
> if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
> return EAGAIN;
>
> if ((sc->sc_flags & LDF_MPSAFE) == 0)
> KERNEL_LOCK(1, curlwp);
>
> mutex_enter(&sc->sc_mutex);
>
> if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
> error = EAGAIN;
> else {
> error = (*sc->sc_start)(sc, bp);
> if (error == 0)
> sc->sc_queuecnt++;
> }
>
> mutex_exit(&sc->sc_mutex);
>
> if ((sc->sc_flags & LDF_MPSAFE) == 0)
> KERNEL_UNLOCK_ONE(curlwp);
>
> return error;
> }
>
>
>
> the first sc_queuecnt check here is done without holding any locks,
> so it could race with lddone() decrementing sc_queuecnt and return EAGAIN,
> which will cause dk_start() to stop sending new I/O to the underlying
> nvme driver even though the driver queue is no longer full.
> lddone() calls dk_start(), but the buffer that ld_diskstart() is rejecting
> is not on the bufq at that point so the lddone() thread wouldn't know
> about it.
> just removing the unlocked check of sc_queuecnt here would fix this race.
>
> however, it seems unlikely that this race would result in more than 1 I/O
> becoming stuck in the bufq, and in this dump we have a bunch of them,
> so it seems very likely that there are more problems that I haven't found
> yet.
>
> the logic in dk_start() that allows two threads to be processing the bufq
> in parallel looks like it could result in losing track of a buffer
> entirely.
> if both threads get EAGAIN from d_diskstart then then will both try to
> save the buffer pointer in dksc->sc_deferred, and the first one will be
> lost.
> but that isn't the problem with the two buffers above since they are both
> still on the bufq. the comment says this special hidey hole for one buffer
> is intended to keep the order of I/Os the same, but that isn't needed for
> correctness, and doing that in the dk layer rather than in the bufq layer
> seems poor. if we want a way to undo a bufq_get() rather than just putting
> the buffer back on the bufq with bufq_put(), then we should add a new API
> for that rather than making callers deal with it.
>
> (for nvme especially it would be better to not require putting all buffers
> for a device on a single bufq at all, since that rather reduces the benefit
> of the hardware having multiple hardware queues. but that kind of design
> change is beyond the scope of the issue at hand.)
>
> I'll keep looking for more bugs that could have caused this.
>
> -Chuck
>
--089e0826a1085ad7e1055f8b1faa
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Hello Check,<div><br></div><div>race WRT sc_deferred shoul=
dn't be possible as noted by Michael. But I've added there a KASSER=
T() now. Can you retry with rev. 1.101 of sys/dev/dksubr.c to rule out that=
as the issue?</div><div><br></div><div>Jaromir<br></div></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-12-01 19:45 GMT+01:00 =
Chuck Silvers <span dir=3D"ltr"><<a href=3D"mailto:chuq%chuq.com@localhost" target=
=3D"_blank">chuq%chuq.com@localhost</a>></span>:<br><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
>I got a copy of the dump from martin and here are some things I found.<br>
the two buffers that he pointed out are for different parts of the same dev=
ice:<br>
<br>
(gdb) p *(struct buf *)0xffffe4041d8f7240<br>
$4 =3D {<br>
=C2=A0 b_u =3D {<br>
=C2=A0 =C2=A0 u_actq =3D {<br>
=C2=A0 =C2=A0 =C2=A0 tqe_next =3D 0xffffe4040e6b6c68,<br>
=C2=A0 =C2=A0 =C2=A0 tqe_prev =3D 0xffffe4029c033ea0<br>
=C2=A0 =C2=A0 },<br>
=C2=A0 =C2=A0 u_rbnode =3D {<br>
=C2=A0 =C2=A0 =C2=A0 rb_nodes =3D {0xffffe4040e6b6c68, 0xffffe4029c033ea0},=
<br>
=C2=A0 =C2=A0 =C2=A0 rb_info =3D 0x0<br>
=C2=A0 =C2=A0 },<br>
=C2=A0 =C2=A0 u_work =3D {<br>
=C2=A0 =C2=A0 =C2=A0 wk_dummy =3D 0xffffe4040e6b6c68<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 b_iodone =3D 0x0,<br>
=C2=A0 b_error =3D 0x0,<br>
=C2=A0 b_resid =3D 0x388e,<br>
=C2=A0 b_flags =3D 0x404,<br>
<br>
#define B_ASYNC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000004=C2=A0 =C2=A0 =
=C2=A0 /* Start I/O, do not wait. */<br>
#define B_COWDONE=C2=A0 =C2=A0 =C2=A0 =C2=A00x00000400=C2=A0 =C2=A0 =C2=A0 =
/* Copy-on-write already done. */<br>
<br>
=C2=A0 b_prio =3D 0x1,<br>
=C2=A0 b_bufsize =3D 0x8000,<br>
=C2=A0 b_bcount =3D 0x8000,<br>
=C2=A0 b_dev =3D 0x1300,<br>
=C2=A0 b_data =3D 0xffff800148420000,<br>
=C2=A0 b_blkno =3D 0xddd06c0,<br>
=C2=A0 b_rawblkno =3D 0xddd0ec0,<br>
=C2=A0 b_proc =3D 0x0,<br>
=C2=A0 b_saveaddr =3D 0x0,<br>
=C2=A0 b_private =3D 0x0,<br>
=C2=A0 b_dcookie =3D 0x0,<br>
=C2=A0 b_busy =3D {<br>
=C2=A0 =C2=A0 cv_opaque =3D {0xffffe4040092db80, 0xffffe4040092dc90, 0xffff=
ffff8116c394}<br>
=C2=A0 },<br>
=C2=A0 b_refcnt =3D 0x2,<br>
=C2=A0 b_unused =3D 0x0,<br>
=C2=A0 b_hash =3D {<br>
=C2=A0 =C2=A0 le_next =3D 0x0,<br>
=C2=A0 =C2=A0 le_prev =3D 0xffffe403f8913f40<br>
=C2=A0 },<br>
=C2=A0 b_vnbufs =3D {<br>
=C2=A0 =C2=A0 le_next =3D 0xffffe4029c033ea0,<br>
=C2=A0 =C2=A0 le_prev =3D 0xffffe4040e6b6d18<br>
=C2=A0 },<br>
=C2=A0 b_freelist =3D {<br>
=C2=A0 =C2=A0 tqe_next =3D 0xffffe4041739a6c0,<br>
=C2=A0 =C2=A0 tqe_prev =3D 0xffffe403b883c308<br>
=C2=A0 },<br>
=C2=A0 b_wapbllist =3D {<br>
=C2=A0 =C2=A0 tqe_next =3D 0x0,<br>
=C2=A0 =C2=A0 tqe_prev =3D 0x0<br>
=C2=A0 },<br>
=C2=A0 b_lblkno =3D 0xddd06c0,<br>
=C2=A0 b_freelistindex =3D 0x1,<br>
=C2=A0 b_cflags =3D 0x4800010,<br>
<br>
#define BC_BUSY=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000010=C2=A0 =C2=A0 =
=C2=A0 /* I/O in progress. */<br>
#define BC_WANTED=C2=A0 =C2=A0 =C2=A0 =C2=A00x00800000=C2=A0 =C2=A0 =C2=A0 =
/* Process wants this buffer. */<br>
#define BC_VFLUSH=C2=A0 =C2=A0 =C2=A0 =C2=A00x04000000=C2=A0 =C2=A0 =C2=A0 =
/* Buffer is being synced. */<br>
<br>
=C2=A0 b_vp =3D 0xffffe4042b5b2bd0,<br>
=C2=A0 b_done =3D {<br>
=C2=A0 =C2=A0 cv_opaque =3D {0x0, 0xffffe4041d8f7338, 0xffffffff8116c39c}<b=
r>
=C2=A0 },<br>
=C2=A0 b_oflags =3D 0x0,<br>
=C2=A0 b_objlock =3D 0xffffe4042b619640<br>
}<br>
<br>
<br>
(gdb) p *(struct buf *)0xffffe40294774480<br>
$5 =3D {<br>
=C2=A0 b_u =3D {<br>
=C2=A0 =C2=A0 u_actq =3D {<br>
=C2=A0 =C2=A0 =C2=A0 tqe_next =3D 0xffffe403b48ac368,<br>
=C2=A0 =C2=A0 =C2=A0 tqe_prev =3D 0xffffe403b48ac488<br>
=C2=A0 =C2=A0 },<br>
=C2=A0 =C2=A0 u_rbnode =3D {<br>
=C2=A0 =C2=A0 =C2=A0 rb_nodes =3D {0xffffe403b48ac368, 0xffffe403b48ac488},=
<br>
=C2=A0 =C2=A0 =C2=A0 rb_info =3D 0x0<br>
=C2=A0 =C2=A0 },<br>
=C2=A0 =C2=A0 u_work =3D {<br>
=C2=A0 =C2=A0 =C2=A0 wk_dummy =3D 0xffffe403b48ac368<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 b_iodone =3D 0x0,<br>
=C2=A0 b_error =3D 0x0,<br>
=C2=A0 b_resid =3D 0x49ec,<br>
=C2=A0 b_flags =3D 0x404,<br>
<br>
#define B_ASYNC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000004=C2=A0 =C2=A0 =
=C2=A0 /* Start I/O, do not wait. */<br>
#define B_COWDONE=C2=A0 =C2=A0 =C2=A0 =C2=A00x00000400=C2=A0 =C2=A0 =C2=A0 =
/* Copy-on-write already done. */<br>
<br>
=C2=A0 b_prio =3D 0x1,<br>
=C2=A0 b_bufsize =3D 0x8000,<br>
=C2=A0 b_bcount =3D 0x8000,<br>
=C2=A0 b_dev =3D 0x1300,<br>
=C2=A0 b_data =3D 0xffff80014cfb0000,<br>
=C2=A0 b_blkno =3D 0x121f1100,<br>
=C2=A0 b_rawblkno =3D 0x121f1900,<br>
=C2=A0 b_proc =3D 0x0,<br>
=C2=A0 b_saveaddr =3D 0x0,<br>
=C2=A0 b_private =3D 0x0,<br>
=C2=A0 b_dcookie =3D 0x0,<br>
=C2=A0 b_busy =3D {<br>
=C2=A0 =C2=A0 cv_opaque =3D {0xffffe403ccf032c0, 0xffffe403ccf033d0, 0xffff=
ffff8116c394}<br>
=C2=A0 },<br>
=C2=A0 b_refcnt =3D 0x2,<br>
=C2=A0 b_unused =3D 0x0,<br>
=C2=A0 b_hash =3D {<br>
=C2=A0 =C2=A0 le_next =3D 0xffffe403e8c7b480,<br>
=C2=A0 =C2=A0 le_prev =3D 0xffff80001d75c158<br>
=C2=A0 },<br>
=C2=A0 b_vnbufs =3D {<br>
=C2=A0 =C2=A0 le_next =3D 0xffffe403b48ac488,<br>
=C2=A0 =C2=A0 le_prev =3D 0xffffe403b48ac418<br>
=C2=A0 },<br>
=C2=A0 b_freelist =3D {<br>
=C2=A0 =C2=A0 tqe_next =3D 0xffffe40412c51120,<br>
=C2=A0 =C2=A0 tqe_prev =3D 0xffffe4041739a300<br>
=C2=A0 },<br>
=C2=A0 b_wapbllist =3D {<br>
=C2=A0 =C2=A0 tqe_next =3D 0x0,<br>
=C2=A0 =C2=A0 tqe_prev =3D 0x0<br>
=C2=A0 },<br>
=C2=A0 b_lblkno =3D 0x121f1100,<br>
=C2=A0 b_freelistindex =3D 0x1,<br>
=C2=A0 b_cflags =3D 0x4800010,<br>
<br>
#define BC_BUSY=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00000010=C2=A0 =C2=A0 =
=C2=A0 /* I/O in progress. */<br>
#define BC_WANTED=C2=A0 =C2=A0 =C2=A0 =C2=A00x00800000=C2=A0 =C2=A0 =C2=A0 =
/* Process wants this buffer. */<br>
#define BC_VFLUSH=C2=A0 =C2=A0 =C2=A0 =C2=A00x04000000=C2=A0 =C2=A0 =C2=A0 =
/* Buffer is being synced. */<br>
<br>
=C2=A0 b_vp =3D 0xffffe4042b5b2bd0,<br>
=C2=A0 b_done =3D {<br>
=C2=A0 =C2=A0 cv_opaque =3D {0x0, 0xffffe40294774578, 0xffffffff8116c39c}<b=
r>
=C2=A0 },<br>
=C2=A0 b_oflags =3D 0x0,<br>
=C2=A0 b_objlock =3D 0xffffe4042b619640<br>
}<br>
<br>
<br>
<br>
b_dev =3D 0x1300 is "ld0a", which is associated with nvme0 as mar=
tin mentioned.<br>
the buffers are B_ASYNC, so they will be unlocked automatically by biodone(=
).<br>
however the nvme driver does not have any I/O pending:<br>
<br>
(gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_<wbr>private)->=
sc_admin_q<br>
$9 =3D {<br>
=C2=A0 q_sc =3D 0xffff80001da34000,<br>
=C2=A0 q_sq_mtx =3D {<br>
=C2=A0 =C2=A0 u =3D {<br>
=C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 q_cq_mtx =3D {<br>
=C2=A0 =C2=A0 u =3D {<br>
=C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 q_sq_dmamem =3D 0xffffe4042f7ceac8,<br>
=C2=A0 q_cq_dmamem =3D 0xffffe4042f7ceaf0,<br>
=C2=A0 q_sqtdbl =3D 0x1000,<br>
=C2=A0 q_cqhdbl =3D 0x1004,<br>
=C2=A0 q_id =3D 0x0,<br>
=C2=A0 q_entries =3D 0x20,<br>
=C2=A0 q_sq_tail =3D 0x4,<br>
=C2=A0 q_cq_head =3D 0x4,<br>
=C2=A0 q_cq_phase =3D 0x1,<br>
=C2=A0 q_ccb_mtx =3D {<br>
=C2=A0 =C2=A0 u =3D {<br>
=C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 q_nccbs =3D 0x20,<br>
=C2=A0 q_nccbs_avail =3D 0x21,<br>
=C2=A0 q_ccbs =3D 0xffffe4011d6e6000,<br>
=C2=A0 q_ccb_list =3D {<br>
=C2=A0 =C2=A0 sqh_first =3D 0xffffe4011d6e6000,<br>
=C2=A0 =C2=A0 sqh_last =3D 0xffffe4011d6e6e88<br>
=C2=A0 },<br>
=C2=A0 q_ccb_prpls =3D 0xffffe4042f7ceb18<br>
}<br>
<br>
(gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_<wbr>private)->=
sc_q[0]<br>
$12 =3D {<br>
=C2=A0 q_sc =3D 0xffff80001da34000,<br>
=C2=A0 q_sq_mtx =3D {<br>
=C2=A0 =C2=A0 u =3D {<br>
=C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 q_cq_mtx =3D {<br>
=C2=A0 =C2=A0 u =3D {<br>
=C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 q_sq_dmamem =3D 0xffffe4042f7ceb40,<br>
=C2=A0 q_cq_dmamem =3D 0xffffe4042f7ceb68,<br>
=C2=A0 q_sqtdbl =3D 0x1008,<br>
=C2=A0 q_cqhdbl =3D 0x100c,<br>
=C2=A0 q_id =3D 0x1,<br>
=C2=A0 q_entries =3D 0x400,<br>
=C2=A0 q_sq_tail =3D 0x3b7,<br>
=C2=A0 q_cq_head =3D 0x3b7,<br>
=C2=A0 q_cq_phase =3D 0x0,<br>
=C2=A0 q_ccb_mtx =3D {<br>
=C2=A0 =C2=A0 u =3D {<br>
=C2=A0 =C2=A0 =C2=A0 mtxa_owner =3D 0x601<br>
=C2=A0 =C2=A0 }<br>
=C2=A0 },<br>
=C2=A0 q_nccbs =3D 0x3ff,<br>
=C2=A0 q_nccbs_avail =3D 0x3ff,<br>
=C2=A0 q_ccbs =3D 0xffff800020d5b000,<br>
=C2=A0 q_ccb_list =3D {<br>
=C2=A0 =C2=A0 sqh_first =3D 0xffff800020d78cb8,<br>
=C2=A0 =C2=A0 sqh_last =3D 0xffff800020d78f10<br>
=C2=A0 },<br>
=C2=A0 q_ccb_prpls =3D 0xffffe4042f7ceb90<br>
}<br>
<br>
<br>
the one normal queue is empty (q_nccbs =3D=3D q_nccbs_avail, and q_sq_tail =
=3D=3D q_cq_head).<br>
the admin queue is a bit odd in that q_nccbs_avail is one greater than q_nc=
cbs,<br>
but it does appear to also be empty since q_sq_tail =3D=3D q_cq_head.<br>
<br>
<br>
the two buffers above, along with 554 others, are still in the bufq of ld0:=
<br>
<br>
(gdb) p ((struct ld_softc *)ld_cd.cd_devs[0]->dv_<wbr>private)->sc_dk=
sc.sc_bufq<br>
$1 =3D (struct bufq_state *) 0xffffe4042f686ce8<br>
(gdb) xbufq_fcfs 0xffffe4042f686ce8<br>
bufq 0xffffe4042f686ce8<br>
0xffffe4025ff23b48<br>
0xffffe4025ff23a28<br>
0xffffe4025ff23908<br>
0xffffe4025ff237e8<br>
0xffffe4025ff236c8<br>
...<br>
0xffffe403b23b4480<br>
0xffffe403cfef36c0<br>
total buffers 556<br>
<br>
<br>
ld0's dk_softc also indicates that no buffers are currently in the hand=
s of<br>
the underlying driver:<br>
<br>
(gdb) p *(struct ld_softc *)ld_cd.cd_devs[0]->dv_private<br>
$14 =3D {<br>
=C2=A0 sc_dksc =3D {<br>
...<br>
=C2=A0 =C2=A0 sc_bufq =3D 0xffffe4042f686ce8,<br>
=C2=A0 =C2=A0 sc_dtype =3D 0xf,<br>
=C2=A0 =C2=A0 sc_deferred =3D 0xffffe4025ff23c68,<br>
=C2=A0 =C2=A0 sc_busy =3D 0x0,<br>
...<br>
=C2=A0 sc_queuecnt =3D 0x0,<br>
...<br>
=C2=A0 sc_maxqueuecnt =3D 0x3ff,<br>
}<br>
<br>
<br>
so the bug appears to be that the dk layer is sitting on a bunch of buffers=
<br>
even though the underlying nvme driver is idle.<br>
<br>
one problem I see is in ld_diskstart():<br>
<br>
static int<br>
ld_diskstart(device_t dev, struct buf *bp)<br>
{<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct ld_softc *sc =3D device_private(dev);<br=
>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 int error;<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (sc->sc_queuecnt >=3D sc->sc_maxque=
uecnt)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return EAGAIN;<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if ((sc->sc_flags & LDF_MPSAFE) =3D=3D 0=
)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 KERNEL_LOCK(1, curl=
wp);<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_enter(&sc->sc_mutex);<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (sc->sc_queuecnt >=3D sc->sc_maxque=
uecnt)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 error =3D EAGAIN;<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 else {<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 error =3D (*sc->=
sc_start)(sc, bp);<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (error =3D=3D 0)=
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 sc->sc_queuecnt++;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_exit(&sc->sc_mutex);<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if ((sc->sc_flags & LDF_MPSAFE) =3D=3D 0=
)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 KERNEL_UNLOCK_ONE(c=
urlwp);<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return error;<br>
}<br>
<br>
<br>
<br>
the first sc_queuecnt check here is done without holding any locks,<br>
so it could race with lddone() decrementing sc_queuecnt and return EAGAIN,<=
br>
which will cause dk_start() to stop sending new I/O to the underlying<br>
nvme driver even though the driver queue is no longer full.<br>
lddone() calls dk_start(), but the buffer that ld_diskstart() is rejecting<=
br>
is not on the bufq at that point so the lddone() thread wouldn't know a=
bout it.<br>
just removing the unlocked check of sc_queuecnt here would fix this race.<b=
r>
<br>
however, it seems unlikely that this race would result in more than 1 I/O<b=
r>
becoming stuck in the bufq, and in this dump we have a bunch of them,<br>
so it seems very likely that there are more problems that I haven't fou=
nd yet.<br>
<br>
the logic in dk_start() that allows two threads to be processing the bufq<b=
r>
in parallel looks like it could result in losing track of a buffer entirely=
.<br>
if both threads get EAGAIN from d_diskstart then then will both try to<br>
save the buffer pointer in dksc->sc_deferred, and the first one will be =
lost.<br>
but that isn't the problem with the two buffers above since they are bo=
th<br>
still on the bufq.=C2=A0 the comment says this special hidey hole for one b=
uffer<br>
is intended to keep the order of I/Os the same, but that isn't needed f=
or<br>
correctness, and doing that in the dk layer rather than in the bufq layer<b=
r>
seems poor.=C2=A0 if we want a way to undo a bufq_get() rather than just pu=
tting<br>
the buffer back on the bufq with bufq_put(), then we should add a new API<b=
r>
for that rather than making callers deal with it.<br>
<br>
(for nvme especially it would be better to not require putting all buffers<=
br>
for a device on a single bufq at all, since that rather reduces the benefit=
<br>
of the hardware having multiple hardware queues.=C2=A0 but that kind of des=
ign<br>
change is beyond the scope of the issue at hand.)<br>
<br>
I'll keep looking for more bugs that could have caused this.<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
-Chuck<br>
</font></span></blockquote></div><br></div>
--089e0826a1085ad7e1055f8b1faa--
Home |
Main Index |
Thread Index |
Old Index