NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/59618: occasional virtio block device lock ups/hangs



>Number:         59618
>Category:       kern
>Synopsis:       occasional virtio block device lock ups/hangs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Aug 30 12:00:00 +0000 2025
>Originator:     Christof Meerwald
>Release:        10.1
>Organization:
>Environment:
NetBSD linveo.cmeerw.net 10.1 NetBSD 10.1 (GENERIC) #22: Sat Aug 30 11:23:19 UTC 2025  cmeerw%linveo.cmeerw.net@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
AMD64 VPS locks up due to missing virtio block device notifications, see thread starting https://mail-index.netbsd.org/netbsd-users/2025/04/08/msg032527.html

db(0)> bt/a ffffdadd0c6dc500
trace: pid 0 lid 195 at 0xffff8981544e0d60
sleepq_block() at netbsd:sleepq_block+0x13a
cv_wait() at netbsd:cv_wait+0xb7
biowait() at netbsd:biowait+0x42
wapbl_buffered_flush() at netbsd:wapbl_buffered_flush+0xa2
wapbl_write_commit() at netbsd:wapbl_write_commit+0x28
wapbl_flush() at netbsd:wapbl_flush+0x552
ffs_sync() at netbsd:ffs_sync+0x176
VFS_SYNC() at netbsd:VFS_SYNC+0x22
sched_sync() at netbsd:sched_sync+0x90
db(0)>

This might be a lot more common on AMD Zen 5 (maybe also Zen 4?) architectures, in my case the VPS is running on an AMD Ryzen 9 9950X 16-Core Processor with 2 cores assigned to the VPS.

From https://mail-index.netbsd.org/netbsd-users/2025/08/29/msg033025.html:

In virtio.c we essentially have

                vq->vq_avail->idx = virtio_rw16(sc, vq->vq_avail_idx);
                vq_sync_aring_header(sc, vq, BUS_DMASYNC_PREWRITE);

                vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
                flags = virtio_rw16(sc, vq->vq_used->flags);

where the BUS_DMASYNC_PREWRITE is a sfence and BUS_DMASYNC_PREWRITE is
an lfence, so we have:

                vq->vq_avail->idx = virtio_rw16(sc, vq->vq_avail_idx);
                x86_sfence();

                x86_lfence();
                flags = virtio_rw16(sc, vq->vq_used->flags);

And https://stackoverflow.com/a/50322404 argues that the store and
load can be reordered here, and this appears to be exactly what I am
seeing.
>How-To-Repeat:
In a VPS with a virtio block device running on and AMD Zen 5 CPU run frequent syncs, e.g.

  while sleep 0.4; do echo Hello >~/stress.txt; sync; rm ~/stress.txt; done

it might lock up after a few hours (maybe up to 2 days).
>Fix:
Add a full memory fence between the "vq->vq_avail_idx" store and the "vq->vq_used->flags" load



Home | Main Index | Thread Index | Old Index