NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/60144: virtio(4) cache coherence issue



OK, here is a candidate patch to address this issue.  Most of what
needs review is the commit message, but it would be good to test
whether it empirically addresses the problem too.  (Note that alone it
is probably _not_ enough to fix PR 60182, but it is probably needed
_in addition_ to the patch I already posted there.)

Isaki-san: Can you verify that this patch works?
# HG changeset patch
# User Taylor R Campbell <riastradh%NetBSD.org@localhost>
# Date 1779137898 0
#      Mon May 18 20:58:18 2026 +0000
# Branch trunk
# Node ID 0f2733c1e2e8f9e109ce9e1258023da90fe016f5
# Parent  7c298a30dee7608bf965a5082affaaafadcefa9b
# EXP-Topic riastradh-pr60144-virtiodmasync
virtio(4): Add missing BUS_DMASYNC_PREREAD operations.

And one missing BUS_DMASYNC_POSTREAD operation.

With this change, loads from vq->vq_used->flags (which occur
immediately after a transfer is submitted to test whether we need to
kick the host device) is separated by a PREREAD/POSTREAD cycle from
loads from vq->vq_used->idx (which occur when we think a transfer may
have completed, e.g. upon receiving an interrupt, to test whether it
has, in fact, completed).

Additionally, with this change, consecutive loads from
*vq->vq_avail_event are separated by a PREREAD/POSTREAD cycle.

Should fix virtio(4) issues on m68k and other related architectures
like mips and armv<7:

PR kern/60144: virtio(4) cache coherence issue

(And may make what I think is a mere potential performance
improvement on those and other architectures like x86 or aarch64,
because of the change to virtio_postpone_intr.)

Why?

This is tricky, and (despite a remark I made to the contrary in the
PR) I'm still not entirely convinced we have the rules articulated
sufficiently to prove the driver correct with or without the
changes, or to prove a bus_dma implementation correct.

But, based on the current mengarie of bus_dma implementations we
have, the crux of the issue is:

   When we load twice from the same cache line affected by a DMA read
   operation, if the DMA read operation may complete in the interim
   between the two loads, it is necessary to have a PREREAD/POSTREAD
   cycle between the loads -- not just POSTREAD.

For example, in a polling loop:

	buf = alloc_buf()
	bus_dmamap_load(map, buf)

	bus_dmamap_sync(map, 0, offsetof(buf->ready), PREREAD)

	bus_space_write(trigger_dma_at, map->dm_segs[0].ds_addr);

	bus_dmamap_sync(map, 0, offsetof(buf->ready), POSTREAD)

	while (!buf->ready) {
		bus_dmamap_sync(map, 0, offsetof(buf->ready), PREREAD)
		bus_dmamap_sync(map, 0, offsetof(buf->ready), POSTREAD)
	}
	...

The pair of PREREAD/POSTREAD syncs in the body of the loop must
_both_ appear in order for this to work; POSTREAD alone is not
enough.

On some architectures, such as x86, PREREAD/POSTREAD is stronger than
needed.

But on others -- such as m68k, mips, and armv<7 -- POSTREAD alone
isn't enough because it's a noop: these architectures assume that
after PREREAD, the driver doesn't load the memory in question until
the DMA read operation has completed, so flushing cached data in
PREREAD and doing nothing in POSTREAD is enough to ensure that the
driver's load isn't stale.

This scenario isn't really spelled out in the bus_dma(9) man page.
And bus_dma(9) doesn't even have a concept of cache lines.  But, even
outside virtio(4), the issue has already bitten us with real
hardware:

1. USB ehci(4) TDs (transfer descriptors), back when they were packed
   into shared cache lines.  This required BUS_DMA_COHERENT, which is
   -- by the letter of bus_dma(9) -- not supposed to affect
   correctness.

   Packing multiple TDs into a single shared cache line required
   BUS_DMA_COHERENT because the host CPU and the device might
   otherwise attempt to update the same cache line simultaneously,
   leading one party to lose its updates; there was no sequence of
   bus_dmamap_syncs that could make it work.

   This was addressed by splitting the TDs across cache lines:

   https://mail-index.NetBSD.org/source-changes/2024/09/23/msg153483.html

2. Polling loops, such as the following in fxp(4), where the driver
   repeatedly loads from the same DMA buffer in memory, separated by
   bus_dmamap_sync calls, until the DMA read operation is done and
   the CPU loads a different value out of the buffer.

   We don't need to define a notion of `the same cache line' in
   bus_dma(9) for this case -- it's exactly the same location in
   memory every time.  The marked PREREAD was added to fix the driver
   on sgimips O2.

   1858 	FXP_CDCONFIGSYNC(sc, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
   1859
   1860 	/*
   1861 	 * Start the config command/DMA.
   1862 	 */
   1863 	fxp_scb_wait(sc);
   1864 	CSR_WRITE_4(sc, FXP_CSR_SCB_GENERAL, sc->sc_cddma + FXP_CDCONFIGOFF);
   1865 	fxp_scb_cmd(sc, FXP_SCB_COMMAND_CU_START);
   1866 	/* ...and wait for it to complete. */
   1867 	for (i = 1000; i > 0; i--) {
   1868 		FXP_CDCONFIGSYNC(sc,
   1869 		    BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
   1870 		status = le16toh(cbp->cb_status);
=> 1871 		FXP_CDCONFIGSYNC(sc, BUS_DMASYNC_PREREAD);
   1872 		if ((status & FXP_CB_STATUS_C) != 0)
   1873 			break;
   1874 		DELAY(1);
   1875 	}

https://nxr.NetBSD.org/xref/src/sys/dev/ic/i82557.c?r=1.162#1858

In the case of virtio(4), we have both of these issues:

1. Once the guest driver has published a transfer by a store to
   vq->vq_avail->idx, it must load vq->vq_used->flags to decide
   whether to kick the host:

   1259 		vq->vq_avail->idx = virtio_rw16(sc, vq->vq_avail_idx);
...
   1275 			vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
   1276 			flags = virtio_rw16(sc, vq->vq_used->flags);
   1277 			if (!(flags & VRING_USED_F_NO_NOTIFY))
   1278 				sc->sc_ops->kick(sc, vq->vq_index);

   Later on, to determine whether the transfer has completed, it must
   load vq->vq_used->idx:

https://nxr.NetBSD.org/xref/src/sys/dev/pci/virtio.c?r=1.84#1259

    600 bool
    601 virtio_vq_is_enqueued(struct virtio_softc *sc, struct virtqueue *vq)
    602 {
...
    609 	vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
    610 	if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx))
    611 		return 0;
...
    613 	return 1;
    614 }

https://nxr.NetBSD.org/xref/src/sys/dev/pci/virtio.c?r=1.84#600

    But vq->vq_used->flags and vq->vq_used->idx are 16-bit integers
    in a likely-aligned buffer, so they are almost certain to share a
    cache line:

    177 struct vring_used {
    178 	uint16_t flags;
    179 	uint16_t idx;

https://nxr.NetBSD.org/xref/src/sys/dev/pci/virtioreg.h?r=1.15#177

   Fortunately, the guest driver and the host device are not competing
   to update the cache line at the same time, so to fix virtio(4) on
   the affected architectures, it is just a matter of issuing a
   PREREAD/POSTREAD cycle between the load of vq->vq_used->flags and
   vq->vq_used->idx, not just POSTREAD.

2. Virtio interrupt handlers use virtio_vq_is_enqueued to test
   whether a transfer on a virtqueue has completed.  If the interrupt
   line is shared with other devices, a driver might test
   virtio_vq_is_enqueued several times in a row from unrelated
   interrupts before it returns true, and this boils down to:

	*interrupt*
	POSTREAD
	load vq->vq_used->idx
	*interrupt*
	POSTREAD
	load vq->vq_used->idx
	*interrupt*
	POSTREAD
	load vq->vq_used->idx
	...

   On some architectures, since POSTREAD does nothing, the first
   value of vq->vq_used->idx may remain cached through the subsequent
   loads.

Note: this is related to, and probably necessary for, PR kern/60182:
ld@virtio sometimes hangs up.  But there is also another issue there
with missing POSTREAD in a loop inside ld(4) -- not just PREREAD.  To
be fixed in a separate commit.

diff -r 7c298a30dee7 -r 0f2733c1e2e8 sys/dev/pci/virtio.c
--- a/sys/dev/pci/virtio.c	Sun May 17 01:31:04 2026 +0000
+++ b/sys/dev/pci/virtio.c	Mon May 18 20:58:18 2026 +0000
@@ -607,8 +607,10 @@ virtio_vq_is_enqueued(struct virtio_soft
 	}
 
 	vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
-	if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx))
+	if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx)) {
+		vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD);
 		return 0;
+	}
 	vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);
 	return 1;
 }
@@ -629,8 +631,10 @@ virtio_postpone_intr(struct virtio_softc
 	vq_sync_aring_used(vq->vq_owner, vq, BUS_DMASYNC_PREWRITE);
 	vq->vq_queued++;
 
+	vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
 	nused = (uint16_t)
 	    (virtio_rw16(sc, vq->vq_used->idx) - vq->vq_used_idx);
+	vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD);
 	KASSERT(nused <= vq->vq_num);
 
 	return nslots < nused;
@@ -717,8 +721,10 @@ virtio_start_vq_intr(struct virtio_softc
 	paravirt_membar_sync();
 
 	vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
-	if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx))
+	if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx)) {
+		vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD);
 		return 0;
+	}
 	vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);
 	return 1;
 }
@@ -1269,11 +1275,13 @@ notify:
 		if (sc->sc_active_features & VIRTIO_F_RING_EVENT_IDX) {
 			vq_sync_uring_avail(sc, vq, BUS_DMASYNC_POSTREAD);
 			t = virtio_rw16(sc, *vq->vq_avail_event) + 1;
+			vq_sync_uring_avail(sc, vq, BUS_DMASYNC_PREREAD);
 			if ((uint16_t) (n - t) < (uint16_t) (n - o))
 				sc->sc_ops->kick(sc, vq->vq_index);
 		} else {
 			vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
 			flags = virtio_rw16(sc, vq->vq_used->flags);
+			vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD);
 			if (!(flags & VRING_USED_F_NO_NOTIFY))
 				sc->sc_ops->kick(sc, vq->vq_index);
 		}


Home | Main Index | Thread Index | Old Index