NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/60144: virtio(4) cache coherence issue



>Number:         60144
>Category:       kern
>Synopsis:       virtio(4) cache coherence issue
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 30 02:55:00 +0000 2026
>Originator:     Tetsuya Isaki
>Release:        NetBSD/virt68k 11.0_RC2
>Organization:
>Environment:
NetBSD 11.0_RC2 virt68k
>Description:
The kernel sometimes hangs up with "Spurious interrupt on CPU ipl 5",
when accessing ld@virtio on virt68k.

This message means "an interrupt occurred, but no handlers took it".
Here, ipl 5 is the interrupt that virtio is assigned (and may be
shared with other devices).

At least in virtio_is_enqueue(), I suspect that there may be some
insufficient cache line invalidation.

 sys/dev/pci/virtio.c:
    601 virtio_vq_is_enqueued(struct virtio_softc *sc, struct virtqueue *vq)
    602 {
    603
    604     if (vq->vq_queued) {
    605         vq->vq_queued = 0;
    606         vq_sync_aring_all(sc, vq, BUS_DMASYNC_POSTWRITE);
    607     }
    608
    609     vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
    610     if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx))
    611         return 0;
    612     vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);
    613     return 1;
    614 }

The virtio device incremented its own index and wrote it to
vq->vq_used->idx.  But when the interrupt was lost, (un)luckily
the data cache still held the previous vq->vq_used->idx, so that
CPU read it from the cache(!).
As you know, the previous vq->vq_used->idx is the same as
vq->vq_used_idx, therefore the function returned 0 (which means
vq is empty), even though the device notified as vq-is-enqueued.

I think that vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD) is
necessary to invalidate the cache line before reading fresh
vq->vq_used->idx (at line 610) ?

This is the only case that I was able to observe by tracing on
emulator.  But many other places look similar.

And the following four results I observed also support this assumption.
- qemu (68040, without cache impl.) could not reproduce.
- nono (68030, with cache impl.) could reproduce.
- nono (68040, without cache impl. yet)  could not reproduce.
- nono (68030, force disable data cache) could not reproduce.

>How-To-Repeat:
Boot NetBSD/virt68k on emulator which implements a data cache.
Access ld@virtio.  But not 100% reproducible.

If someone else updates the same cache line, this problem will not
reproduce.  When reproducible, I typically encountered it within
20 sets of the following command pair.

 # mount /dev/ld1e /mnt; umount /mnt
  :

>Fix:
See above.




Home | Main Index | Thread Index | Old Index