NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/59618: occasional virtio block device lock ups/hangs
>Number: 59618
>Category: kern
>Synopsis: occasional virtio block device lock ups/hangs
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Aug 30 12:00:00 +0000 2025
>Originator: Christof Meerwald
>Release: 10.1
>Organization:
>Environment:
NetBSD linveo.cmeerw.net 10.1 NetBSD 10.1 (GENERIC) #22: Sat Aug 30 11:23:19 UTC 2025 cmeerw%linveo.cmeerw.net@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
AMD64 VPS locks up due to missing virtio block device notifications, see thread starting https://mail-index.netbsd.org/netbsd-users/2025/04/08/msg032527.html
db(0)> bt/a ffffdadd0c6dc500
trace: pid 0 lid 195 at 0xffff8981544e0d60
sleepq_block() at netbsd:sleepq_block+0x13a
cv_wait() at netbsd:cv_wait+0xb7
biowait() at netbsd:biowait+0x42
wapbl_buffered_flush() at netbsd:wapbl_buffered_flush+0xa2
wapbl_write_commit() at netbsd:wapbl_write_commit+0x28
wapbl_flush() at netbsd:wapbl_flush+0x552
ffs_sync() at netbsd:ffs_sync+0x176
VFS_SYNC() at netbsd:VFS_SYNC+0x22
sched_sync() at netbsd:sched_sync+0x90
db(0)>
This might be a lot more common on AMD Zen 5 (maybe also Zen 4?) architectures, in my case the VPS is running on an AMD Ryzen 9 9950X 16-Core Processor with 2 cores assigned to the VPS.
From https://mail-index.netbsd.org/netbsd-users/2025/08/29/msg033025.html:
In virtio.c we essentially have
vq->vq_avail->idx = virtio_rw16(sc, vq->vq_avail_idx);
vq_sync_aring_header(sc, vq, BUS_DMASYNC_PREWRITE);
vq_sync_uring_header(sc, vq, BUS_DMASYNC_POSTREAD);
flags = virtio_rw16(sc, vq->vq_used->flags);
where the BUS_DMASYNC_PREWRITE is a sfence and BUS_DMASYNC_PREWRITE is
an lfence, so we have:
vq->vq_avail->idx = virtio_rw16(sc, vq->vq_avail_idx);
x86_sfence();
x86_lfence();
flags = virtio_rw16(sc, vq->vq_used->flags);
And https://stackoverflow.com/a/50322404 argues that the store and
load can be reordered here, and this appears to be exactly what I am
seeing.
>How-To-Repeat:
In a VPS with a virtio block device running on and AMD Zen 5 CPU run frequent syncs, e.g.
while sleep 0.4; do echo Hello >~/stress.txt; sync; rm ~/stress.txt; done
it might lock up after a few hours (maybe up to 2 days).
>Fix:
Add a full memory fence between the "vq->vq_avail_idx" store and the "vq->vq_used->flags" load
Home |
Main Index |
Thread Index |
Old Index