NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/60144: virtio(4) cache coherence issue



The following reply was made to PR kern/60144; it has been noted by GNATS.

From: Nick Hudson <nick.hudson%gmx.co.uk@localhost>
To: Taylor R Campbell <riastradh%NetBSD.org@localhost>,
 Jason Thorpe <thorpej%me.com@localhost>
Cc: Tetsuya Isaki <isaki%pastel-flower.jp@localhost>,
 gnats-bugs%netbsd.org@localhost,
 kern-bug-people%netbsd.org@localhost,
 netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/60144: virtio(4) cache coherence issue
Date: Sat, 11 Apr 2026 08:03:43 +0100

 On 11/04/2026 03:35, Taylor R Campbell wrote:
 >> Date: Fri, 10 Apr 2026 04:48:32 -0700
 >> From: Jason Thorpe <thorpej%me.com@localhost>
 >>=20
 >> Taylor, what's your take on the situation?
 > I don't know any details about m68k caches (let alone 68030 vs 68040).
 
 
 =46rom my skimming of the conversation m68k caches (and the m68k bus_dma
 implementation) behave in the same way as other architectures, e.g.
 pre-armv7
 
 
 > There has to be a simple usage rule for the bus_dmamap_sync API
 > contract, and the virtio author can't have to think about m68k cache
 > semantics to apply the rule.
 > Let's try to isolate the relevant logic in this case.  There's one
 > virtqueue (struct ld_virtio_softc::sc_vq), and only the `used' ring is
 > relevant (struct virtqueue::vq_used), documented at
 > =
 https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.htm=
 l#x1-540008
 > which gives the following procedure:
 >> If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
 >>=20
 >>   - The driver MUST ignore the avail_event value.
 >>   - After the driver writes a descriptor index into the available =
 ring:
 >>       o If flags is 1, the driver SHOULD NOT send a notification.
 >>       o If flags is 0, the driver MUST send a notification.
 > The flags live at vq->vq_uring->flags, and 1 is spelled
 > VRING_USED_F_NO_NOTIFY in our code.
 > After an interrupt arrives, we check vq->vq_uring->idx to see if it
 > has completed, and the layout is:
 > struct vring_used {
 >         uint16_t flags;
 >         uint16_t idx;
 >         struct vring_used_elem ring[];
 > 	/* trailed by uint16_t avail_event when VIRTIO_F_RING_EVENT_IDX =
 */
 > } __packed;
 > https://nxr.NetBSD.org/xref/src/sys/dev/pci/virtioreg.h?r=3D1.15#177
 > The vring_used structure is written only by the host (`device', in the
 > language of the virtio spec); the guest (`driver') only reads it.
 > We can stipulate for this analysis that the flags and idx members live
 > in the same cache line.
 > The logic for the guest to initiate a transfer is something like
 > (eliding all syncs _except_ for the vring_used ones):
 > 1. prepare transfer
 > 2. bus_dmamap_sync(vq_used, PREREAD)
 > 3. start transfer
 > 4. bus_dmamap_sync(vq_used, POSTREAD)
 > 5. read vq_used->flags to see if kick needed
 > 6. (left intentionally blank)
 > 7. interrupt handler called
 > 8. bus_dmamap_sync(vq_used, POSTREAD)
 > 9. read vq_used->idx to see if completed
 > 	/*
 > 	 * 1. prepare transfer
 > 	 */
 > 	bus_dmamap_load();
 > 	vq->vq_avail->ring[vq->vq_avail_idx++] =3D slot;
 > 	/*
 > 	 * 2. host device will return result in vq->vq_used, so
 > 	 *    prepare to read
 > 	 */
 > 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_PREREAD);
 > 	/*
 > 	 * 3. start transfer -- host device may, at any time after
 > 	 *    this, start to write data into vq->vq_used (may require
 > 	 *    a kick, but may not); guest driver must not read
 > 	 *    vq->vq_used->flags before this point
 > 	 */
 > 	vq->vq_avail->idx =3D vq->vq_avail_idx;
 > 	/*
 > 	 * 4. we're going to read vq->vq_used->flags, but since we
 > 	 *    started a DMA transfer into vq->vq_used, we can't read
 > 	 *    the result until after BUS_DMASYNC_POSTREAD
 > 	 */
 > 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_POSTREAD);
 > 	/*
 > 	 * 5. check vq->vq_used->flags to see if a kick is required
 > 	 */
 > 	flags =3D vq->vq_used->flags;
 > 	/* 6. (left intentionally blank) */
 > 	if (!(flags & VRING_USED_F_NO_NOTIFY))
 > 		kick();
 > 	/*
 > 	 * 7. time passes and an interrupt handler is called
 > 	 */
 > 	/*
 > 	 * 8. we're going to read vq->vq_used->idx, but since we
 > 	 *    started a DMA transfer into vq->vq_used, we can't read
 > 	 *    the result until after BUS_DMASYNC_POSTREAD
 > 	 */
 > 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_POSTREAD);
 > 	/*
 > 	 * 9. check vq->vq_used->idx to see if completed
 > 	 */
 > 	if (vq->vq_used->idx !=3D vq->vq_used_idx)
 > 		process_result();
 > I think, from the bus_dma(9) documentation, this logic is essentially
 > correct.  It follows the ordering rules:
 > (a) Between preparation at (1) and start of transfer at (3), there is =
 a
 >     BUS_DMASYNC_PREREAD at (3).
 > (b) Between a potential change of state (start of transfer) at (3) and
 >     use of vq->vq_used->flags in the transfer buffer vq->vq_used at
 >     (5), there is a BUS_DMASYNC_POSTREAD at (4).
 >     =3D> (Actually, this one is even stronger: there is a
 >        paravirt_membar_sync between (3) and (5), which is a barrier
 >        that doesn't even really exist in the m68k architecture -- it
 >        is necessary for store-before-load ordering on a multiprocessor
 >        host, even if the guest is running with a single vCPU, so
 >        behind the scenes on an x86 host it has to be MFENCE or LOCK
 >        ADD or similar.  But none of that is relevant here.)
 > (c) Between a potential change of state (interrupt delivery) at (7)
 >     and use of vq->vq_used->idx in the transfer buffer vq->vq_used at
 >     (9), there is a BUS_DMASYNC_POSTREAD at (8).
 > But there's a snag: the results of the DMA operation might come
 > piecemeal over time, with the host device _first_ putting data for the
 > guest driver to read in vq->vq_used->flags, and _later_ putting data
 > for the guest driver to read in vq->vq_used->idx.
 > So that raises the key question:
 > 	If we POSTREAD and read _one part_ of the transfer buffer at
 > 	one time, and POSTREAD and read _another part_ of it at
 > 	another time, do we have to do a PREREAD in the interim at (6)
 > 	because between the two reads is essentially a new DMA
 > 	operation?
 
 I think the answer is yes.
 
 
 > The question doesn't just apply here -- it also applies to the polling
 > scenario (like PR kern/60182: ld@virtio sometimes hangs up):
 > 	bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
 > 	trigger_dma();
 > 	for (;;) {
 > 		bus_dmamap_sync(buf, BUS_DMASYNC_POSTREAD);
 > 		if (buf->ready)
 > 			break;
 > 	}
 > 	use(buf->data);
 > Does this need to be rewritten with another PREREAD for every
 > iteration?
 > 	bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
 > 	trigger_dma();
 > 	for (;;) {
 > 		bus_dmamap_sync(buf, BUS_DMASYNC_POSTREAD);
 > 		if (buf->ready)
 > 			break;
 > 		bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
 > 	}
 > 	use(buf->data);
 
 So we need this...
 
 It=E2=80=99s also required in the coherent case so we get the =
 appropriate memory barriers, right?
 
 
 > If the answer is YES, then virtio(4) is broken and we need to sprinkle
 > some vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD) into both
 > virtio_enqueue_commit and virtio_vq_is_enqueued (partly because it is
 > used for polling, partly because it might share an interrupt line and
 > so might legitimately called several times owing to other devices'
 > interrupts before it returns true).
 > If the answer is NO, then the 68030 bus_dmamap_sync(POSTREAD) is
 > broken, and flushing the cache in PREREAD is not enough.
 



Home | Main Index | Thread Index | Old Index