Re: kern/60144: virtio(4) cache coherence issue

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,isaki%pastel-flower.jp@localhost
Subject: Re: kern/60144: virtio(4) cache coherence issue
From: "Taylor R Campbell via gnats" <gnats-admin%NetBSD.org@localhost>
Date: Sat, 11 Apr 2026 02:40:02 +0000 (UTC)

The following reply was made to PR kern/60144; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: Jason Thorpe <thorpej%me.com@localhost>
Cc: Tetsuya Isaki <isaki%pastel-flower.jp@localhost>, gnats-bugs%netbsd.org@localhost,
	kern-bug-people%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/60144: virtio(4) cache coherence issue
Date: Sat, 11 Apr 2026 02:35:39 +0000

 > Date: Fri, 10 Apr 2026 04:48:32 -0700
 > From: Jason Thorpe <thorpej%me.com@localhost>
 >=20
 > Taylor, what's your take on the situation?

 I don't know any details about m68k caches (let alone 68030 vs 68040).

 There has to be a simple usage rule for the bus_dmamap_sync API
 contract, and the virtio author can't have to think about m68k cache
 semantics to apply the rule.

 Let's try to isolate the relevant logic in this case.  There's one
 virtqueue (struct ld_virtio_softc::sc_vq), and only the `used' ring is
 relevant (struct virtqueue::vq_used), documented at

 https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html=
 #x1-540008

 which gives the following procedure:

 > If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
 >=20
 >   - The driver MUST ignore the avail_event value.
 >   - After the driver writes a descriptor index into the available ring:
 >       o If flags is 1, the driver SHOULD NOT send a notification.
 >       o If flags is 0, the driver MUST send a notification.

 The flags live at vq->vq_uring->flags, and 1 is spelled
 VRING_USED_F_NO_NOTIFY in our code.

 After an interrupt arrives, we check vq->vq_uring->idx to see if it
 has completed, and the layout is:

 struct vring_used {
         uint16_t flags;
         uint16_t idx;
         struct vring_used_elem ring[];
 	/* trailed by uint16_t avail_event when VIRTIO_F_RING_EVENT_IDX */
 } __packed;

 https://nxr.NetBSD.org/xref/src/sys/dev/pci/virtioreg.h?r=3D1.15#177

 The vring_used structure is written only by the host (`device', in the
 language of the virtio spec); the guest (`driver') only reads it.

 We can stipulate for this analysis that the flags and idx members live
 in the same cache line.

 The logic for the guest to initiate a transfer is something like
 (eliding all syncs _except_ for the vring_used ones):

 1. prepare transfer
 2. bus_dmamap_sync(vq_used, PREREAD)
 3. start transfer
 4. bus_dmamap_sync(vq_used, POSTREAD)
 5. read vq_used->flags to see if kick needed
 6. (left intentionally blank)
 7. interrupt handler called
 8. bus_dmamap_sync(vq_used, POSTREAD)
 9. read vq_used->idx to see if completed

 	/*
 	 * 1. prepare transfer
 	 */
 	bus_dmamap_load();
 	vq->vq_avail->ring[vq->vq_avail_idx++] =3D slot;

 	/*
 	 * 2. host device will return result in vq->vq_used, so
 	 *    prepare to read
 	 */
 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_PREREAD);

 	/*
 	 * 3. start transfer -- host device may, at any time after
 	 *    this, start to write data into vq->vq_used (may require
 	 *    a kick, but may not); guest driver must not read
 	 *    vq->vq_used->flags before this point
 	 */
 	vq->vq_avail->idx =3D vq->vq_avail_idx;

 	/*
 	 * 4. we're going to read vq->vq_used->flags, but since we
 	 *    started a DMA transfer into vq->vq_used, we can't read
 	 *    the result until after BUS_DMASYNC_POSTREAD
 	 */
 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_POSTREAD);

 	/*
 	 * 5. check vq->vq_used->flags to see if a kick is required
 	 */
 	flags =3D vq->vq_used->flags;
 	/* 6. (left intentionally blank) */
 	if (!(flags & VRING_USED_F_NO_NOTIFY))
 		kick();

 	/*
 	 * 7. time passes and an interrupt handler is called
 	 */

 	/*
 	 * 8. we're going to read vq->vq_used->idx, but since we
 	 *    started a DMA transfer into vq->vq_used, we can't read
 	 *    the result until after BUS_DMASYNC_POSTREAD
 	 */
 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_POSTREAD);

 	/*
 	 * 9. check vq->vq_used->idx to see if completed
 	 */
 	if (vq->vq_used->idx !=3D vq->vq_used_idx)
 		process_result();

 I think, from the bus_dma(9) documentation, this logic is essentially
 correct.  It follows the ordering rules:

 (a) Between preparation at (1) and start of transfer at (3), there is a
     BUS_DMASYNC_PREREAD at (3).

 (b) Between a potential change of state (start of transfer) at (3) and
     use of vq->vq_used->flags in the transfer buffer vq->vq_used at
     (5), there is a BUS_DMASYNC_POSTREAD at (4).
     =3D> (Actually, this one is even stronger: there is a
        paravirt_membar_sync between (3) and (5), which is a barrier
        that doesn't even really exist in the m68k architecture -- it
        is necessary for store-before-load ordering on a multiprocessor
        host, even if the guest is running with a single vCPU, so
        behind the scenes on an x86 host it has to be MFENCE or LOCK
        ADD or similar.  But none of that is relevant here.)

 (c) Between a potential change of state (interrupt delivery) at (7)
     and use of vq->vq_used->idx in the transfer buffer vq->vq_used at
     (9), there is a BUS_DMASYNC_POSTREAD at (8).

 But there's a snag: the results of the DMA operation might come
 piecemeal over time, with the host device _first_ putting data for the
 guest driver to read in vq->vq_used->flags, and _later_ putting data
 for the guest driver to read in vq->vq_used->idx.

 So that raises the key question:

 	If we POSTREAD and read _one part_ of the transfer buffer at
 	one time, and POSTREAD and read _another part_ of it at
 	another time, do we have to do a PREREAD in the interim at (6)
 	because between the two reads is essentially a new DMA
 	operation?

 The question doesn't just apply here -- it also applies to the polling
 scenario (like PR kern/60182: ld@virtio sometimes hangs up):

 	bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
 	trigger_dma();
 	for (;;) {
 		bus_dmamap_sync(buf, BUS_DMASYNC_POSTREAD);
 		if (buf->ready)
 			break;
 	}
 	use(buf->data);

 Does this need to be rewritten with another PREREAD for every
 iteration?

 	bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
 	trigger_dma();
 	for (;;) {
 		bus_dmamap_sync(buf, BUS_DMASYNC_POSTREAD);
 		if (buf->ready)
 			break;
 		bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
 	}
 	use(buf->data);

 If the answer is YES, then virtio(4) is broken and we need to sprinkle
 some vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD) into both
 virtio_enqueue_commit and virtio_vq_is_enqueued (partly because it is
 used for polling, partly because it might share an interrupt line and
 so might legitimately called several times owing to other devices'
 interrupts before it returns true).

 If the answer is NO, then the 68030 bus_dmamap_sync(POSTREAD) is
 broken, and flushing the cache in PREREAD is not enough.

Prev by Date: Re: kern/60144: virtio(4) cache coherence issue
Next by Date: Re: kern/60144: virtio(4) cache coherence issue
Previous by Thread: Re: kern/60144: virtio(4) cache coherence issue
Next by Thread: Re: kern/60144: virtio(4) cache coherence issue
Indexes:

Home | Main Index | Thread Index | Old Index