NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
port-amd64/59862: nvmm: support pvclock
>Number: 59862
>Category: port-amd64
>Synopsis: nvmm: support pvclock
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-amd64-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Dec 24 18:00:01 +0000 2025
>Originator: Taylor R Campbell
>Release: current, 11, 10, 9, ...
>Organization:
Host TimeBSD Virtualization
>Environment:
>Description:
Guests generally need access to a clock to keep time.
Operating systems for x86 often use RDTSC to keep time if the
CPU advertises an invariant TSC.
NVMM currently just passes through the relevant CPUID data, bit
8 of CPUID[EAX=0x80000007].EDX (which we call CPUID_APM_ITSC),
so on hosts on CPUs advertising invariant TSC, the guest also
sees a CPU advertising invariant TSC.
On startup, NVMM exposes TSC=0. On VM exit, NVMM records the
guest's TSC as cpudata->gtsc. On VM entry after migration from
one host CPU to another host CPU -- and under certain other
circumstances, like nested page faults or I/O instructions --
NVMM arranges for the TSC to start counting in the guest where
it left off at the last VM exit, cpudata->gtsc.
It is easy to prove that this is _monotonic_, but when
migrating across host CPUs or paging or doing I/O there may be
long delays of real time not reflected in the guest's TSC.
One way to fix this might be to record the systemwide uptime in
seconds since boot on every VM exit, and again on migration, so
we know how many real time seconds elapsed in the migration,
and convert that back to a TSC offset.
What other hypervisors do -- including Xen and KVM -- is expose
a page shared between the guest and the host with a structure
containing a version number and parameters for reading out the
time:
struct vcpu_time_info {
uint32_t version;
...
uint64_t tsc_timestamp;
uint64_t system_time;
uint32_t tsc_to_system_mul;
int8_t tsc_shift;
...
};
Whenever the guest migrates from one host CPU to another, the
host updates the parameters as needed, and bumps the version,
so that the following algorithm gives a monotonic view of
`system time' _at a roughly constant frequency_ even in the
face of delays during migration across host CPUs (or even hosts
altogether):
do {
while ((v = vt->version) & 1)
continue; /* update in progress */
delta_tsc = rdtsc() - vt->tsc_timestamp;
delta_systime = (delta_tsc << vt->tsc_shift) *
vt->tsc_to_system_mul;
system_time = vt->system_time + delta_systime;
} while (v != vt->version);
Essentially every OS has a paravirtualized clock driver
(`pvclock') for this. The way it is exposed to guests varies
from hypervisor to hypervisor -- in Xen, the page is mapped by
Xen hypercalls; in KVM, the page is discovered and exposed
through KVM-specific MSRs -- but the structure and algorithm
seem to be the same for everyone.
We should implement this, maybe via the KVM MSRs so existing
operating systems can take advantage of it automatically.
NOTE: This will not fix the problem where sleeps take too long
in the guest. That happens because the host NetBSD doesn't
have fine enough resolution for sleeping: PR kern/43997: Kernel
timer discrepancies.
>How-To-Repeat:
Use TSC as timecounter in a guest on a host that is heavily
loaded and migrating threads across CPUs regularly.
>Fix:
Yes, please!
Home |
Main Index |
Thread Index |
Old Index