NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
port-amd64/59424: hardclock ticks run at breakneck pace under qemu
>Number: 59424
>Category: port-amd64
>Synopsis: hardclock ticks run at breakneck pace under qemu
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-amd64-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 16 01:45:00 +0000 2025
>Originator: Taylor R Campbell
>Release: current
>Organization:
The Clock of the Lapic Now Foundation
>Environment:
>Description:
Recent current kernels running under qemu -- with nvmm on a
netbsd-10 host, unchanged in a while -- have been running with
dramatically accelerated hardclock timers, even when they are
built with HZ=100 as confirmed by sysctl kern.clockrate:
# sysctl kern.clockrate
kern.clockrate: tick = 10000, tickadj = 40, hz = 100, profhz = 100, stathz = 100
The timecounter (hpet) seems to run normally, so `date' shows
the right progression of time `sleep 10' sleeps for about 10sec
of real time. But sampling the hardclock_ticks with crash
shows it growing at 500 ticks per second, rather than 100 ticks
per second as expected, and:
# vmstat -i | grep timer
cpu0 timer 542109 507
So the hardclock timer is running at about 500 Hz rather than
100 Hz. And while `date' and `sleep 10' work, callouts are
accelerated so that it's hard to type without tripping wskbd
repeat.
I compared an older kernel that works (from February or so) and
a newer kernel that is broken (from this past week) and this
part stood out -- the `good' kernel had:
[ 1.000004] cpu0: [re]calibrating local timer
[ 1.000004] cpu0: apic clock running at 24 MHz
...
[ 1.028453] cpu0: [re]calibrating local timer
[ 1.028453] cpu0: apic clock running at 1000 MHz
But the `bad' kernel only had the first message, not the
second. And printing lapic_per_second with crash(8) confirmed
that it is 24 * 10^6, not 1000 * 10^6.
I went looking for that message and lapic_per_second
initialization, and found the following recent change:
changeset: 1192122:0ff4d825447c
branch: trunk
user: imil <imil%NetBSD.org@localhost>
date: Fri May 02 07:08:10 2025 +0000
files: sys/arch/x86/include/apicvar.h sys/arch/x86/x86/cpu.c sys/arch/x86/x86/identcpu_subr.c sys/arch/x86/x86/lapic.c
description:
Add support for CPUID leaf 0x40000010 to detect TSC and LAPIC frequency on
hypervisors implementing the VMware-defined interface
This change enables virtual machines to obtain TSC and LAPIC frequency
information directly from the hypervisor via CPUID leaf 0x40000010, avoiding
the need for runtime calibration, thus reducing boot speed in supported
environments.
Tested on GENERIC and MICROVM kernels, QEMU/KVM and QEMU/NVMM (current and
10.1), Intel and AMD CPUs, NetBSD/amd64 and i386.
...
diff -r 82b0bdcd458e -r 0ff4d825447c sys/arch/x86/x86/identcpu_subr.c
--- a/sys/arch/x86/x86/identcpu_subr.c Fri May 02 03:26:26 2025 +0000
+++ b/sys/arch/x86/x86/identcpu_subr.c Fri May 02 07:08:10 2025 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: identcpu_subr.c,v 1.13 2025/03/06 15:35:05 imil Exp $ */
+/* $NetBSD: identcpu_subr.c,v 1.14 2025/05/02 07:08:11 imil Exp $ */
/*-
* Copyright (c) 2020 The NetBSD Foundation, Inc.
...
@@ -133,12 +166,28 @@ cpu_tsc_freq_cpuid(struct cpu_info *ci)
#if defined(_KERNEL) && NLAPIC > 0
if ((khz != 0) && (lapic_per_second == 0)) {
lapic_per_second = khz * 1000;
+ lapic_from_cpuid = true;
aprint_debug_dev(ci->ci_dev,
"lapic_per_second set to %" PRIu32 "\n",
lapic_per_second);
}
#endif
This appears to skip the second lapic calibration step on some
machines where the _physical_ processor base frequency is
determined by CPUID. On this machine, in the host and passed
through to the guest in qemu (nvmm does nothing to munge these
CPUID leaves):
CPUID[15h]
denom@eax=2 num@ebx=176 freq@ecx=0 zero@edx=0
CPUID[16h]
basefreq@eax=4200MHz maxfreq@ebx=2100MHz busfreq@ecx=100MHz zero@edx=0
tsc_freq_cpuid determines the frequency from this.
https://nxr.netbsd.org/xref/src/sys/arch/x86/x86/identcpu_subr.c?r=1.14#108
But it winds up wrong, according to the second calibration with
respect to the TSC. My guess is that qemu chooses its own
lapic frequency (1000 MHz) rather than taking the same
frequency as the physical hardware (24 MHz), but since nvmm
passes CPUID leaves 15h and 16h verbatim, the guest is
confused.
Some relevant parameters from different contexts:
(physical host)
crash> x/d lapic_per_second
lapic_per_second: 24000000
crash> x/d lapic_tval
lapic_tval: 24000
crash> x/d hz
hz: 1000
(good kernel, nvmm, hardclock runs roughly at expected frequency)
crash> x/d lapic_per_second
lapic_per_second: 1000007000
crash> x/d lapic_tval
lapic_tval: 10000070
crash> x/d hz
hz: 100
(bad kernel, nvmm, hardclock runs fast)
crash> x/d lapic_per_second
lapic_per_second: 24000000
crash> x/d lapic_tval
lapic_tval: 240000
crash> x/d hz
hz: 100
(bad kernel, no nvmm, hardclock runs roughly at expected frequency)
crash> x/d lapic_per_second
lapic_per_second: 1000572000
crash> x/d lapic_tval
lapic_tval: 10005720
crash> x/d hz
hz: 100
It's not yet clear to me why the hardclock timer is running at
roughly 5x the frequency it should in the affected guests: the
reported frequency of 24 MHz and the measured frequency of 1000
MHz are off by a factor of about 41. But maybe qemu is trying
to attain a higher frequency and hitting the Nyquist frequency
of the host's 1000 Hz hardclock timer, 500 Hz.
>How-To-Repeat:
Boot a current kernel under qemu with nvmm.
>Fix:
1. On the guest side: The change to take the hypervisor TSC
frequency from CPUID[4000_0010h] included another change
which is to skip lapic calibration even if we _don't_ get
the TSC frequency from that hypervisor leaf. Maybe this
change is the right thing, generally, but I think it is
likely not necessary for fast-boot MICROVM (which goes via
the CPUID[4000_0010h] path instead anyway) and I think we
should revert this change.
2. On the host side: nvmm should maybe not pass through the
lapic frequency verbatim, and should maybe munge it to
reflect the frequency that qemu emulates. Perhaps other
hypervisors have some precedent here for reporting the lapic
frequency to the guest -- I haven't investigated.
Home |
Main Index |
Thread Index |
Old Index