NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/60144: virtio(4) cache coherence issue



On 11/04/2026 03:35, Taylor R Campbell wrote:
>> Date: Fri, 10 Apr 2026 04:48:32 -0700
>> From: Jason Thorpe <thorpej%me.com@localhost>
>> 
>> Taylor, what's your take on the situation?
> I don't know any details about m68k caches (let alone 68030 vs 68040).


From my skimming of the conversation m68k caches (and the m68k bus_dma
implementation) behave in the same way as other architectures, e.g.
pre-armv7


> There has to be a simple usage rule for the bus_dmamap_sync API
> contract, and the virtio author can't have to think about m68k cache
> semantics to apply the rule.
> Let's try to isolate the relevant logic in this case.  There's one
> virtqueue (struct ld_virtio_softc::sc_vq), and only the `used' ring is
> relevant (struct virtqueue::vq_used), documented at
> https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-540008
> which gives the following procedure:
>> If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
>> 
>>   - The driver MUST ignore the avail_event value.
>>   - After the driver writes a descriptor index into the available ring:
>>       o If flags is 1, the driver SHOULD NOT send a notification.
>>       o If flags is 0, the driver MUST send a notification.
> The flags live at vq->vq_uring->flags, and 1 is spelled
> VRING_USED_F_NO_NOTIFY in our code.
> After an interrupt arrives, we check vq->vq_uring->idx to see if it
> has completed, and the layout is:
> struct vring_used {
>         uint16_t flags;
>         uint16_t idx;
>         struct vring_used_elem ring[];
> 	/* trailed by uint16_t avail_event when VIRTIO_F_RING_EVENT_IDX */
> } __packed;
> https://nxr.NetBSD.org/xref/src/sys/dev/pci/virtioreg.h?r=1.15#177
> The vring_used structure is written only by the host (`device', in the
> language of the virtio spec); the guest (`driver') only reads it.
> We can stipulate for this analysis that the flags and idx members live
> in the same cache line.
> The logic for the guest to initiate a transfer is something like
> (eliding all syncs _except_ for the vring_used ones):
> 1. prepare transfer
> 2. bus_dmamap_sync(vq_used, PREREAD)
> 3. start transfer
> 4. bus_dmamap_sync(vq_used, POSTREAD)
> 5. read vq_used->flags to see if kick needed
> 6. (left intentionally blank)
> 7. interrupt handler called
> 8. bus_dmamap_sync(vq_used, POSTREAD)
> 9. read vq_used->idx to see if completed
> 	/*
> 	 * 1. prepare transfer
> 	 */
> 	bus_dmamap_load();
> 	vq->vq_avail->ring[vq->vq_avail_idx++] = slot;
> 	/*
> 	 * 2. host device will return result in vq->vq_used, so
> 	 *    prepare to read
> 	 */
> 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_PREREAD);
> 	/*
> 	 * 3. start transfer -- host device may, at any time after
> 	 *    this, start to write data into vq->vq_used (may require
> 	 *    a kick, but may not); guest driver must not read
> 	 *    vq->vq_used->flags before this point
> 	 */
> 	vq->vq_avail->idx = vq->vq_avail_idx;
> 	/*
> 	 * 4. we're going to read vq->vq_used->flags, but since we
> 	 *    started a DMA transfer into vq->vq_used, we can't read
> 	 *    the result until after BUS_DMASYNC_POSTREAD
> 	 */
> 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_POSTREAD);
> 	/*
> 	 * 5. check vq->vq_used->flags to see if a kick is required
> 	 */
> 	flags = vq->vq_used->flags;
> 	/* 6. (left intentionally blank) */
> 	if (!(flags & VRING_USED_F_NO_NOTIFY))
> 		kick();
> 	/*
> 	 * 7. time passes and an interrupt handler is called
> 	 */
> 	/*
> 	 * 8. we're going to read vq->vq_used->idx, but since we
> 	 *    started a DMA transfer into vq->vq_used, we can't read
> 	 *    the result until after BUS_DMASYNC_POSTREAD
> 	 */
> 	bus_dmamap_sync(vq->vq_used, BUS_DMASYNC_POSTREAD);
> 	/*
> 	 * 9. check vq->vq_used->idx to see if completed
> 	 */
> 	if (vq->vq_used->idx != vq->vq_used_idx)
> 		process_result();
> I think, from the bus_dma(9) documentation, this logic is essentially
> correct.  It follows the ordering rules:
> (a) Between preparation at (1) and start of transfer at (3), there is a
>     BUS_DMASYNC_PREREAD at (3).
> (b) Between a potential change of state (start of transfer) at (3) and
>     use of vq->vq_used->flags in the transfer buffer vq->vq_used at
>     (5), there is a BUS_DMASYNC_POSTREAD at (4).
>     => (Actually, this one is even stronger: there is a
>        paravirt_membar_sync between (3) and (5), which is a barrier
>        that doesn't even really exist in the m68k architecture -- it
>        is necessary for store-before-load ordering on a multiprocessor
>        host, even if the guest is running with a single vCPU, so
>        behind the scenes on an x86 host it has to be MFENCE or LOCK
>        ADD or similar.  But none of that is relevant here.)
> (c) Between a potential change of state (interrupt delivery) at (7)
>     and use of vq->vq_used->idx in the transfer buffer vq->vq_used at
>     (9), there is a BUS_DMASYNC_POSTREAD at (8).
> But there's a snag: the results of the DMA operation might come
> piecemeal over time, with the host device _first_ putting data for the
> guest driver to read in vq->vq_used->flags, and _later_ putting data
> for the guest driver to read in vq->vq_used->idx.
> So that raises the key question:
> 	If we POSTREAD and read _one part_ of the transfer buffer at
> 	one time, and POSTREAD and read _another part_ of it at
> 	another time, do we have to do a PREREAD in the interim at (6)
> 	because between the two reads is essentially a new DMA
> 	operation?

I think the answer is yes.


> The question doesn't just apply here -- it also applies to the polling
> scenario (like PR kern/60182: ld@virtio sometimes hangs up):
> 	bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
> 	trigger_dma();
> 	for (;;) {
> 		bus_dmamap_sync(buf, BUS_DMASYNC_POSTREAD);
> 		if (buf->ready)
> 			break;
> 	}
> 	use(buf->data);
> Does this need to be rewritten with another PREREAD for every
> iteration?
> 	bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
> 	trigger_dma();
> 	for (;;) {
> 		bus_dmamap_sync(buf, BUS_DMASYNC_POSTREAD);
> 		if (buf->ready)
> 			break;
> 		bus_dmamap_sync(buf, BUS_DMASYNC_PREREAD);
> 	}
> 	use(buf->data);

So we need this...

It’s also required in the coherent case so we get the appropriate memory barriers, right?


> If the answer is YES, then virtio(4) is broken and we need to sprinkle
> some vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD) into both
> virtio_enqueue_commit and virtio_vq_is_enqueued (partly because it is
> used for polling, partly because it might share an interrupt line and
> so might legitimately called several times owing to other devices'
> interrupts before it returns true).
> If the answer is NO, then the 68030 bus_dmamap_sync(POSTREAD) is
> broken, and flushing the cache in PREREAD is not enough.




Home | Main Index | Thread Index | Old Index