NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/59618: occasional virtio block device lock ups/hangs
The following reply was made to PR kern/59618; it has been noted by GNATS.
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: cmeerw%cmeerw.org@localhost
Cc: gnats-bugs%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost, thorpej%NetBSD.org@localhost
Subject: Re: kern/59618: occasional virtio block device lock ups/hangs
Date: Sat, 30 Aug 2025 14:08:19 +0000
tl;dr: We need a machine-independent store-before-load barrier for
virtual DMA rings. Not membar_sync, because that's a noop on a single
vCPU and we may be coordinating with a host on another pCPU.
Could make it specific to virtio but it's also needed by hyperv (where
we abuse membar_sync) and Xen (where it's called xen_mb) so bus_dma or
some other part of the kernel API might be a better place.
> Date: Sat, 30 Aug 2025 11:58:38 +0000 (UTC)
> From: Christof Meerwald <cmeerw%cmeerw.org@localhost>
>=20
> In virtio.c we essentially have
>=20
> vq->vq_avail->idx =3D virtio_rw16(sc, vq->vq_avail_idx);
> vq_sync_aring_header(sc, vq, BUS_DMASYNC_PREWRITE);
>=20
> vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
> flags =3D virtio_rw16(sc, vq->vq_used->flags);
>=20
> where the BUS_DMASYNC_PREWRITE is a sfence and BUS_DMASYNC_PREWRITE is
> an lfence, so we have:
>=20
> vq->vq_avail->idx =3D virtio_rw16(sc, vq->vq_avail_idx);
> x86_sfence();
>=20
> x86_lfence();
> flags =3D virtio_rw16(sc, vq->vq_used->flags);
>=20
> And https://stackoverflow.com/a/50322404 argues that the store and
> load can be reordered here, and this appears to be exactly what I am
> seeing.
>=20
> [...]
> >Fix:
> Add a full memory fence between the "vq->vq_avail_idx" store and the "vq-=
>vq_used->flags" load
I think this analysis is right (likewise for the
VIRTIO_F_RING_EVENT_IDX branch), and I'm embarrassed to have missed it
when I did a virtio(4) membar audit three years ago!
https://mail-index.netbsd.org/source-changes/2022/08/12/msg140222.html
A better reference is:
AMD64 Architecture Programmer's Manual, Volume 2: System Programming,
24593--Rev. 3.38--November 2021, Sec. 7.4.2 Memory Barrier Interaction
with Memory Types, Table 7-3, p. 196.
https://web.archive.org/web/20220625040004/https://www.amd.com/system/files=
/TechDocs/24593.pdf#page=3D256
The same problem likely appears in virtio_start_vq_intr:
696 if (sc->sc_active_features & VIRTIO_F_RING_EVENT_IDX) {
...
702 *vq->vq_used_event =3D virtio_rw16(sc, vq->vq_used_idx);
703 vq_sync_aring_used(sc, vq, BUS_DMASYNC_PREWRITE);
704 } else {
705 vq->vq_avail->flags &=3D
706 ~virtio_rw16(sc, VRING_AVAIL_F_NO_INTERRUPT);
707 vq_sync_aring_header(sc, vq, BUS_DMASYNC_PREWRITE);
708 }
...
=3D> 711 vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
712 if (vq->vq_used_idx =3D=3D virtio_rw16(sc, vq->vq_used->idx))
713 return 0;
714 vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);
https://nxr.netbsd.org/xref/src/sys/dev/pci/virtio.c?r=3D1.83#692
The store to *vq->vq_used_event or vq->vq_avail->flags must be ordered
before the load from vq->vq_used->idx.
Unfortunately, bus_dma doesn't have a machine-independent concept of a
store-before-load barrier, which is generally needed for idioms of the
form
ring->cpu_producer_index++;
...
if (!ring->device_consumer_is_running)
notify_device();
or
ring->cpu_consuming_interrupts =3D true;
...
if (ring->device_interrupts_pending)
handle_interrupts();
It is _not_ correct to insert membar_sync in the ellipsis, because the
guest may be running with a single _virtual_ CPU and patch away the
barrier inside membar_sync, while the host may be running on another
_physical_ CPU so the virtio driver still requires the barrier.
We really need to create a different machine-independent abstraction
for this. We could create a virtio_membar_sync() like OpenBSD uses
but I suspect this may be relevant beyond virtio(4), so perhaps it
should live in bus_dma or elsewhere in the kernel API. In some bus
dma tags it might be elidable, if the memory is mapped uncached by
those tags.
For Xen we have the xen_mb macro, used in the corresponding ring
producer/consumer logic (but no bus_dma):
295 #define RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(_r, _notify) do { =
\
296 RING_IDX __old =3D (_r)->sring->req_prod; =
\
297 RING_IDX __new =3D (_r)->req_prod_pvt; =
\
298 xen_wmb(); /* back sees requests /before/ updated producer inde=
x */ \
299 (_r)->sring->req_prod =3D __new; =
\
300 xen_mb(); /* back sees new requests /before/ we check req_event=
*/ \
301 (_notify) =3D ((RING_IDX)(__new - (_r)->sring->req_event) < =
\
302 (RING_IDX)(__new - __old)); =
\
303 } while (0)
https://nxr.netbsd.org/xref/src/sys/external/mit/xen-include-public/dist/xe=
n/include/public/io/ring.h?r=3D1.1#295
For hyperv, we are incorrectly using membar_sync (but also no
bus_dma):
834 msg->msg_type =3D HYPERV_MSGTYPE_NONE;
835 membar_sync();
836 if (msg->msg_flags & VMBUS_MSGFLAG_PENDING)
837 hyperv_send_eom();
https://nxr.netbsd.org/xref/src/sys/dev/hyperv/vmbus.c?r=3D1.19#835
Home |
Main Index |
Thread Index |
Old Index