port-amd64/59862: nvmm: support pvclock

To: port-amd64-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: port-amd64/59862: nvmm: support pvclock
From: campbell+netbsd%mumble.net@localhost
Date: Wed, 24 Dec 2025 18:00:01 +0000 (UTC)

>Number:         59862
>Category:       port-amd64
>Synopsis:       nvmm: support pvclock
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-amd64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 24 18:00:01 +0000 2025
>Originator:     Taylor R Campbell
>Release:        current, 11, 10, 9, ...
>Organization:
Host TimeBSD Virtualization
>Environment:
>Description:

	Guests generally need access to a clock to keep time.

	Operating systems for x86 often use RDTSC to keep time if the
	CPU advertises an invariant TSC.

	NVMM currently just passes through the relevant CPUID data, bit
	8 of CPUID[EAX=0x80000007].EDX (which we call CPUID_APM_ITSC),
	so on hosts on CPUs advertising invariant TSC, the guest also
	sees a CPU advertising invariant TSC.

	On startup, NVMM exposes TSC=0.  On VM exit, NVMM records the
	guest's TSC as cpudata->gtsc.  On VM entry after migration from
	one host CPU to another host CPU -- and under certain other
	circumstances, like nested page faults or I/O instructions --
	NVMM arranges for the TSC to start counting in the guest where
	it left off at the last VM exit, cpudata->gtsc.

	It is easy to prove that this is _monotonic_, but when
	migrating across host CPUs or paging or doing I/O there may be
	long delays of real time not reflected in the guest's TSC.

	One way to fix this might be to record the systemwide uptime in
	seconds since boot on every VM exit, and again on migration, so
	we know how many real time seconds elapsed in the migration,
	and convert that back to a TSC offset.

	What other hypervisors do -- including Xen and KVM -- is expose
	a page shared between the guest and the host with a structure
	containing a version number and parameters for reading out the
	time:


		struct vcpu_time_info {
			uint32_t version;
			...
			uint64_t tsc_timestamp;
			uint64_t system_time;
			uint32_t tsc_to_system_mul;
			int8_t tsc_shift;
			...
		};

	Whenever the guest migrates from one host CPU to another, the
	host updates the parameters as needed, and bumps the version,
	so that the following algorithm gives a monotonic view of
	`system time' _at a roughly constant frequency_ even in the
	face of delays during migration across host CPUs (or even hosts
	altogether):

		do {
			while ((v = vt->version) & 1)
				continue;	/* update in progress */
			delta_tsc = rdtsc() - vt->tsc_timestamp;
			delta_systime = (delta_tsc << vt->tsc_shift) *
			    vt->tsc_to_system_mul;
			system_time = vt->system_time + delta_systime;
		} while (v != vt->version);

	Essentially every OS has a paravirtualized clock driver
	(`pvclock') for this.  The way it is exposed to guests varies
	from hypervisor to hypervisor -- in Xen, the page is mapped by
	Xen hypercalls; in KVM, the page is discovered and exposed
	through KVM-specific MSRs -- but the structure and algorithm
	seem to be the same for everyone.

	We should implement this, maybe via the KVM MSRs so existing
	operating systems can take advantage of it automatically.

	NOTE: This will not fix the problem where sleeps take too long
	in the guest.  That happens because the host NetBSD doesn't
	have fine enough resolution for sleeping: PR kern/43997: Kernel
	timer discrepancies.

>How-To-Repeat:

	Use TSC as timecounter in a guest on a host that is heavily
	loaded and migrating threads across CPUs regularly.

>Fix:

	Yes, please!

Prev by Date: bin/59861: ccdconfig -g doesn't do as promised
Next by Date: Re: kern/59859: NetBSD-10.1 amd64 kernel panic under X
Previous by Thread: bin/59861: ccdconfig -g doesn't do as promised
Next by Thread: Re: kern/58539: AVX-512 support incomplete/broken
Indexes:

Home | Main Index | Thread Index | Old Index