kern/60144: virtio(4) cache coherence issue

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/60144: virtio(4) cache coherence issue
From: "isaki%pastel-flower.jp@localhost via gnats" <gnats-admin%NetBSD.org@localhost>
Date: Mon, 30 Mar 2026 02:55:00 +0000 (UTC)

>Number:         60144
>Category:       kern
>Synopsis:       virtio(4) cache coherence issue
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 30 02:55:00 +0000 2026
>Originator:     Tetsuya Isaki
>Release:        NetBSD/virt68k 11.0_RC2
>Organization:
>Environment:
NetBSD 11.0_RC2 virt68k
>Description:
The kernel sometimes hangs up with "Spurious interrupt on CPU ipl 5",
when accessing ld@virtio on virt68k.

This message means "an interrupt occurred, but no handlers took it".
Here, ipl 5 is the interrupt that virtio is assigned (and may be
shared with other devices).

At least in virtio_is_enqueue(), I suspect that there may be some
insufficient cache line invalidation.

 sys/dev/pci/virtio.c:
    601 virtio_vq_is_enqueued(struct virtio_softc *sc, struct virtqueue *vq)
    602 {
    603
    604     if (vq->vq_queued) {
    605         vq->vq_queued = 0;
    606         vq_sync_aring_all(sc, vq, BUS_DMASYNC_POSTWRITE);
    607     }
    608
    609     vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
    610     if (vq->vq_used_idx == virtio_rw16(sc, vq->vq_used->idx))
    611         return 0;
    612     vq_sync_uring_payload(sc, vq, BUS_DMASYNC_POSTREAD);
    613     return 1;
    614 }

The virtio device incremented its own index and wrote it to
vq->vq_used->idx.  But when the interrupt was lost, (un)luckily
the data cache still held the previous vq->vq_used->idx, so that
CPU read it from the cache(!).
As you know, the previous vq->vq_used->idx is the same as
vq->vq_used_idx, therefore the function returned 0 (which means
vq is empty), even though the device notified as vq-is-enqueued.

I think that vq_sync_uring_header(sc, vq, BUS_DMASYNC_PREREAD) is
necessary to invalidate the cache line before reading fresh
vq->vq_used->idx (at line 610) ?

This is the only case that I was able to observe by tracing on
emulator.  But many other places look similar.

And the following four results I observed also support this assumption.
- qemu (68040, without cache impl.) could not reproduce.
- nono (68030, with cache impl.) could reproduce.
- nono (68040, without cache impl. yet)  could not reproduce.
- nono (68030, force disable data cache) could not reproduce.

>How-To-Repeat:
Boot NetBSD/virt68k on emulator which implements a data cache.
Access ld@virtio.  But not 100% reproducible.

If someone else updates the same cache line, this problem will not
reproduce.  When reproducible, I typically encountered it within
20 sets of the following command pair.

 # mount /dev/ld1e /mnt; umount /mnt
  :

>Fix:
See above.

Follow-Ups:
- Re: kern/60144: virtio(4) cache coherence issue
  - From: Robert Elz
- Re: kern/60144: virtio(4) cache coherence issue
  - From: Jason Thorpe

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: kern/60145: vioif(4) panic on NetBSD/virt68k 11.0_RC2
Previous by Thread: Re: pkg/59858: audio/pavucontrol broken
Next by Thread: Re: kern/60144: virtio(4) cache coherence issue
Indexes:

Home | Main Index | Thread Index | Old Index