NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

port-amd64/59424: hardclock ticks run at breakneck pace under qemu



>Number:         59424
>Category:       port-amd64
>Synopsis:       hardclock ticks run at breakneck pace under qemu
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-amd64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 16 01:45:00 +0000 2025
>Originator:     Taylor R Campbell
>Release:        current
>Organization:
The Clock of the Lapic Now Foundation
>Environment:
>Description:

	Recent current kernels running under qemu -- with nvmm on a
	netbsd-10 host, unchanged in a while -- have been running with
	dramatically accelerated hardclock timers, even when they are
	built with HZ=100 as confirmed by sysctl kern.clockrate:

# sysctl kern.clockrate
kern.clockrate: tick = 10000, tickadj = 40, hz = 100, profhz = 100, stathz = 100

	The timecounter (hpet) seems to run normally, so `date' shows
	the right progression of time `sleep 10' sleeps for about 10sec
	of real time.  But sampling the hardclock_ticks with crash
	shows it growing at 500 ticks per second, rather than 100 ticks
	per second as expected, and:

# vmstat -i | grep timer
cpu0 timer     542109  507

	So the hardclock timer is running at about 500 Hz rather than
	100 Hz.  And while `date' and `sleep 10' work, callouts are
	accelerated so that it's hard to type without tripping wskbd
	repeat.

	I compared an older kernel that works (from February or so) and
	a newer kernel that is broken (from this past week) and this
	part stood out -- the `good' kernel had:

[     1.000004] cpu0: [re]calibrating local timer
[     1.000004] cpu0: apic clock running at 24 MHz
...
[     1.028453] cpu0: [re]calibrating local timer
[     1.028453] cpu0: apic clock running at 1000 MHz

	But the `bad' kernel only had the first message, not the
	second.  And printing lapic_per_second with crash(8) confirmed
	that it is 24 * 10^6, not 1000 * 10^6.

	I went looking for that message and lapic_per_second
	initialization, and found the following recent change:

changeset:   1192122:0ff4d825447c
branch:      trunk
user:        imil <imil%NetBSD.org@localhost>
date:        Fri May 02 07:08:10 2025 +0000
files:       sys/arch/x86/include/apicvar.h sys/arch/x86/x86/cpu.c sys/arch/x86/x86/identcpu_subr.c sys/arch/x86/x86/lapic.c
description:
Add support for CPUID leaf 0x40000010 to detect TSC and LAPIC frequency on
hypervisors implementing the VMware-defined interface

This change enables virtual machines to obtain TSC and LAPIC frequency
information directly from the hypervisor via CPUID leaf 0x40000010, avoiding
the need for runtime calibration, thus reducing boot speed in supported
environments.

Tested on GENERIC and MICROVM kernels, QEMU/KVM and QEMU/NVMM (current and
10.1), Intel and AMD CPUs, NetBSD/amd64 and i386.
...
diff -r 82b0bdcd458e -r 0ff4d825447c sys/arch/x86/x86/identcpu_subr.c
--- a/sys/arch/x86/x86/identcpu_subr.c	Fri May 02 03:26:26 2025 +0000
+++ b/sys/arch/x86/x86/identcpu_subr.c	Fri May 02 07:08:10 2025 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: identcpu_subr.c,v 1.13 2025/03/06 15:35:05 imil Exp $ */
+/* $NetBSD: identcpu_subr.c,v 1.14 2025/05/02 07:08:11 imil Exp $ */
 
 /*-
  * Copyright (c) 2020 The NetBSD Foundation, Inc.
...
@@ -133,12 +166,28 @@ cpu_tsc_freq_cpuid(struct cpu_info *ci)
 #if defined(_KERNEL) &&  NLAPIC > 0
 		if ((khz != 0) && (lapic_per_second == 0)) {
 			lapic_per_second = khz * 1000;
+			lapic_from_cpuid = true;
 			aprint_debug_dev(ci->ci_dev,
 			    "lapic_per_second set to %" PRIu32 "\n",
 			    lapic_per_second);
 		}
 #endif

	This appears to skip the second lapic calibration step on some
	machines where the _physical_ processor base frequency is
	determined by CPUID.  On this machine, in the host and passed
	through to the guest in qemu (nvmm does nothing to munge these
	CPUID leaves):

CPUID[15h]
	denom@eax=2 num@ebx=176 freq@ecx=0 zero@edx=0
CPUID[16h]
	basefreq@eax=4200MHz maxfreq@ebx=2100MHz busfreq@ecx=100MHz zero@edx=0

	tsc_freq_cpuid determines the frequency from this.

https://nxr.netbsd.org/xref/src/sys/arch/x86/x86/identcpu_subr.c?r=1.14#108

	But it winds up wrong, according to the second calibration with
	respect to the TSC.  My guess is that qemu chooses its own
	lapic frequency (1000 MHz) rather than taking the same
	frequency as the physical hardware (24 MHz), but since nvmm
	passes CPUID leaves 15h and 16h verbatim, the guest is
	confused.

	Some relevant parameters from different contexts:

	(physical host)
crash> x/d lapic_per_second
lapic_per_second:       24000000
crash> x/d lapic_tval
lapic_tval:     24000
crash> x/d hz
hz:     1000

	(good kernel, nvmm, hardclock runs roughly at expected frequency)
crash> x/d lapic_per_second
lapic_per_second:       1000007000
crash> x/d lapic_tval
lapic_tval:     10000070
crash> x/d hz
hz:     100

	(bad kernel, nvmm, hardclock runs fast)
crash> x/d lapic_per_second
lapic_per_second:       24000000
crash> x/d lapic_tval
lapic_tval:     240000
crash> x/d hz
hz:     100

	(bad kernel, no nvmm, hardclock runs roughly at expected frequency)
crash> x/d lapic_per_second
lapic_per_second:       1000572000
crash> x/d lapic_tval
lapic_tval:     10005720
crash> x/d hz
hz:     100

	It's not yet clear to me why the hardclock timer is running at
	roughly 5x the frequency it should in the affected guests: the
	reported frequency of 24 MHz and the measured frequency of 1000
	MHz are off by a factor of about 41.  But maybe qemu is trying
	to attain a higher frequency and hitting the Nyquist frequency
	of the host's 1000 Hz hardclock timer, 500 Hz.


>How-To-Repeat:

	Boot a current kernel under qemu with nvmm.


>Fix:

	1. On the guest side: The change to take the hypervisor TSC
	   frequency from CPUID[4000_0010h] included another change
	   which is to skip lapic calibration even if we _don't_ get
	   the TSC frequency from that hypervisor leaf.  Maybe this
	   change is the right thing, generally, but I think it is
	   likely not necessary for fast-boot MICROVM (which goes via
	   the CPUID[4000_0010h] path instead anyway) and I think we
	   should revert this change.

	2. On the host side: nvmm should maybe not pass through the
	   lapic frequency verbatim, and should maybe munge it to
	   reflect the frequency that qemu emulates.  Perhaps other
	   hypervisors have some precedent here for reporting the lapic
	   frequency to the guest -- I haven't investigated.




Home | Main Index | Thread Index | Old Index