NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/60144: virtio(4) cache coherence issue



I talked it over with Nick on ICB and he convinced me that this
question about

	buf = alloc()
	PREREAD
	trigger DMA
	interrupt
	POSTREAD
	if (buf->ready) {
		???
		use(buf->data)
	}

is not quite analogous, though a polling loop around buf->ready may
be analogous.  We had a similar issue with ehci on armv5:

https://mail-index.NetBSD.org/source-changes-d/2025/09/29/msg014539.html

Here's the virtio(4) logic in question:

/* virtio_enqueue_commit */
   1255 		vq_sync_uring_all(sc, vq, BUS_DMASYNC_PREREAD);
...
   1259 		vq->vq_avail->idx = virtio_rw16(sc, vq->vq_avail_idx);
...
   1275 			vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
   1276 			flags = virtio_rw16(sc, vq->vq_used->flags);
   1277 			if (!(flags & VRING_USED_F_NO_NOTIFY))
   1278 				sc->sc_ops->kick(sc, vq->vq_index);
...

https://nxr.netbsd.org/xref/src/sys/dev/pci/virtio.c?r=1.84#1247

/* virtio_vq_is_enqueued */
    609 	vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
    610 	if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx))
    611 		return 0;
    612 	vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);

https://nxr.netbsd.org/xref/src/sys/dev/pci/virtio.c?r=1.84#609

The trouble is that the CPU issues an extra load at line 1276 _before
the DMA operation may have actually been triggered_ at line 1278, and,
more subtly, the order of operations by the virtio host device isn't
quite the same in the two places where there's a (correct) POSTREAD
barrier.

For normal buf->ready / buf->data access, the protocol is more or
less:

host device model
-----------------
H1. store buf->data
H2. (barrier)
H3. set buf->ready := 1

guest driver
------------
G1. test buf->ready
G2. POSTREAD
G3. load buf->data

And that guest driver logic is what happens in virtio_vq_is_enqueued,
with a POSTREAD between testing buf->ready (vq->vq_used->idx) and
using the data (uring payload).  So that part is fine.

What's different with flags/idx is that the host device _can't_ start
the DMA operation before setting flags.  So the full host side looks
like:

virtio host device
------------------
H1'. set vq->vq_used->flags appropriately
H2'. store-before-load barrier
H3'. load vq->vq_avail->idx
H4'. DMA operation to fill uring payload
H5'. (barrier)
H6'. store vq->vq_used->idx

Note that the ordering here is reversed: while the simplified host
device model separates H1 (store buf->data) by some barrier from H3
(set buf->ready) _in that order_, the virtio host device sets
vq->vq_used->flags before it can possibly even start the DMA
operation.  (Then H4'/H5'/H6' correspond to the simplified host device
model's H1/H2/H3.)

So I'm convinced that our guest virtio(4) driver should issue another
PREREAD barrier between loading vq->vq_used->flags and either
(a) kicking the DMA, or
(b) loading vq->vq_used->idx to check for DMA completion.

--- src/sys/dev/pci/virtio.c
+++ src/sys/dev/pci/virtio.c
@@ -1274,6 +1274,7 @@
 		} else {
 			vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
 			flags = virtio_rw16(sc, vq->vq_used->flags);
+			vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD);
 			if (!(flags & VRING_USED_F_NO_NOTIFY))
 				sc->sc_ops->kick(sc, vq->vq_index);
 		}

What I'm still not sure about -- until further thought after a lunch
barrier at least -- is:

1. Does a _polling loop_ require additional PREREADs?

	PREREAD;
	trigger DMA;
	POSTREAD;
	while (!buf->ready) {
		PREREAD;	// <--- ???
		POSTREAD;
	}
	POSTREAD;
	use(buf->data);

   And, can the rule about this be phrased in terms of higher-level
   semantics of DMA operations than details of the 68030 cache
   implementation?

2. Does the VIRTIO_F_RING_EVENT_IDX case require PREREAD too?

--- src/sys/dev/pci/virtio.c
+++ src/sys/dev/pci/virtio.c
@@ -1269,6 +1269,7 @@
 		if (sc->sc_active_features & VIRTIO_F_RING_EVENT_IDX) {
 			vq_sync_uring_avail(sc, vq, BUS_DMASYNC_POSTREAD);
 			t = virtio_rw16(sc, *vq->vq_avail_event) + 1;
+			vq_sync_uring_avail(sc, vq, BUS_DMASYNC_PREREAD);
 			if ((uint16_t) (n - t) < (uint16_t) (n - o))
 				sc->sc_ops->kick(sc, vq->vq_index);
 		} else {



Home | Main Index | Thread Index | Old Index