NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/59618: occasional virtio block device lock ups/hangs



The following reply was made to PR kern/59618; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: cmeerw%cmeerw.org@localhost
Cc: gnats-bugs%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost, thorpej%NetBSD.org@localhost
Subject: Re: kern/59618: occasional virtio block device lock ups/hangs
Date: Sat, 30 Aug 2025 14:08:19 +0000

 tl;dr: We need a machine-independent store-before-load barrier for
 virtual DMA rings.  Not membar_sync, because that's a noop on a single
 vCPU and we may be coordinating with a host on another pCPU.
 
 Could make it specific to virtio but it's also needed by hyperv (where
 we abuse membar_sync) and Xen (where it's called xen_mb) so bus_dma or
 some other part of the kernel API might be a better place.
 
 
 > Date: Sat, 30 Aug 2025 11:58:38 +0000 (UTC)
 > From: Christof Meerwald <cmeerw%cmeerw.org@localhost>
 >=20
 > In virtio.c we essentially have
 >=20
 >                 vq->vq_avail->idx =3D virtio_rw16(sc, vq->vq_avail_idx);
 >                 vq_sync_aring_header(sc, vq, BUS_DMASYNC_PREWRITE);
 >=20
 >                 vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
 >                 flags =3D virtio_rw16(sc, vq->vq_used->flags);
 >=20
 > where the BUS_DMASYNC_PREWRITE is a sfence and BUS_DMASYNC_PREWRITE is
 > an lfence, so we have:
 >=20
 >                 vq->vq_avail->idx =3D virtio_rw16(sc, vq->vq_avail_idx);
 >                 x86_sfence();
 >=20
 >                 x86_lfence();
 >                 flags =3D virtio_rw16(sc, vq->vq_used->flags);
 >=20
 > And https://stackoverflow.com/a/50322404 argues that the store and
 > load can be reordered here, and this appears to be exactly what I am
 > seeing.
 >=20
 > [...]
 > >Fix:
 > Add a full memory fence between the "vq->vq_avail_idx" store and the "vq-=
 >vq_used->flags" load
 
 I think this analysis is right (likewise for the
 VIRTIO_F_RING_EVENT_IDX branch), and I'm embarrassed to have missed it
 when I did a virtio(4) membar audit three years ago!
 
 https://mail-index.netbsd.org/source-changes/2022/08/12/msg140222.html
 
 A better reference is:
 
 AMD64 Architecture Programmer's Manual, Volume 2: System Programming,
 24593--Rev. 3.38--November 2021, Sec. 7.4.2 Memory Barrier Interaction
 with Memory Types, Table 7-3, p. 196.
 https://web.archive.org/web/20220625040004/https://www.amd.com/system/files=
 /TechDocs/24593.pdf#page=3D256
 
 The same problem likely appears in virtio_start_vq_intr:
 
     696 	if (sc->sc_active_features & VIRTIO_F_RING_EVENT_IDX) {
     ...
     702 		*vq->vq_used_event =3D virtio_rw16(sc, vq->vq_used_idx);
     703 		vq_sync_aring_used(sc, vq, BUS_DMASYNC_PREWRITE);
     704 	} else {
     705 		vq->vq_avail->flags &=3D
     706 		    ~virtio_rw16(sc, VRING_AVAIL_F_NO_INTERRUPT);
     707 		vq_sync_aring_header(sc, vq, BUS_DMASYNC_PREWRITE);
     708 	}
     ...
  =3D> 711 	vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
     712 	if (vq->vq_used_idx =3D=3D virtio_rw16(sc, vq->vq_used->idx))
     713 		return 0;
     714 	vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);
 
 https://nxr.netbsd.org/xref/src/sys/dev/pci/virtio.c?r=3D1.83#692
 
 The store to *vq->vq_used_event or vq->vq_avail->flags must be ordered
 before the load from vq->vq_used->idx.
 
 Unfortunately, bus_dma doesn't have a machine-independent concept of a
 store-before-load barrier, which is generally needed for idioms of the
 form
 
 	ring->cpu_producer_index++;
 	...
 	if (!ring->device_consumer_is_running)
 		notify_device();
 
 or
 
 	ring->cpu_consuming_interrupts =3D true;
 	...
 	if (ring->device_interrupts_pending)
 		handle_interrupts();
 
 It is _not_ correct to insert membar_sync in the ellipsis, because the
 guest may be running with a single _virtual_ CPU and patch away the
 barrier inside membar_sync, while the host may be running on another
 _physical_ CPU so the virtio driver still requires the barrier.
 
 We really need to create a different machine-independent abstraction
 for this.  We could create a virtio_membar_sync() like OpenBSD uses
 but I suspect this may be relevant beyond virtio(4), so perhaps it
 should live in bus_dma or elsewhere in the kernel API.  In some bus
 dma tags it might be elidable, if the memory is mapped uncached by
 those tags.
 
 For Xen we have the xen_mb macro, used in the corresponding ring
 producer/consumer logic (but no bus_dma):
 
     295 #define RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(_r, _notify) do {      =
      \
     296     RING_IDX __old =3D (_r)->sring->req_prod;                      =
        \
     297     RING_IDX __new =3D (_r)->req_prod_pvt;                         =
        \
     298     xen_wmb(); /* back sees requests /before/ updated producer inde=
 x */ \
     299     (_r)->sring->req_prod =3D __new;                               =
        \
     300     xen_mb(); /* back sees new requests /before/ we check req_event=
  */  \
     301     (_notify) =3D ((RING_IDX)(__new - (_r)->sring->req_event) <    =
        \
     302                  (RING_IDX)(__new - __old));                       =
      \
     303 } while (0)
 
 https://nxr.netbsd.org/xref/src/sys/external/mit/xen-include-public/dist/xe=
 n/include/public/io/ring.h?r=3D1.1#295
 
 For hyperv, we are incorrectly using membar_sync (but also no
 bus_dma):
 
     834 		msg->msg_type =3D HYPERV_MSGTYPE_NONE;
     835 		membar_sync();
     836 		if (msg->msg_flags & VMBUS_MSGFLAG_PENDING)
     837 			hyperv_send_eom();
 
 https://nxr.netbsd.org/xref/src/sys/dev/hyperv/vmbus.c?r=3D1.19#835
 


Home | Main Index | Thread Index | Old Index