Xen timecounter

To: port-xen%NetBSD.org@localhost
Subject: Xen timecounter
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Tue, 31 Oct 2017 05:40:56 +0000

There are a few problems with the current xen_system_time timecounter
that cause it to sometimes run backwards, even though a timecounter is
supposed to be a globally ordered monotonic clock.  (Whether we should
also have an MI API for per-CPU monotonic clocks requiring no global
ordering is a separate question that I'm not addressing right now.)

Here are some of the problems leading to the symptom of a monotonic
clock, as noted in, e.g., PR 52664 <https://gnats.netbsd.org/52664>:

1. Xen's system_time/tsc_timestamp sometimes go backwards by a little.

2. The CPU's TSC sometimes goes backwards, perhaps because of the vCPU
   got migrated to another pCPU.

3. xen_get_timecount is not globally ordered, so calling it on one
   (v)CPU and then calling it on another (v)CPU may give an earlier
   answer.

4. xen_get_timecount is not even approximately synchronized like the
   non-Xen x86 tsc_timecounter is.

5. The mechanism of catching up on missed hardclock ticks by rewinding
   the timecounter, ummm, has the effect of rewinding the timecounter,
   which makes it rather non-monotonic.  It seems to me that (a) this
   idea of a fake clock is fundamentally problematic, and (b) we
   should reduce our dependency on mandatory accurate hardclock ticks
   so that we can go tickless anyway.

6. By code inspection, our mechanism for reading the shared Xen
   vcpu_time_info record appears racy.

7. By code inspection, our reliance on cpu_frequency(curcpu()) in
   xen_delay seems likely to go wrong if we ever migrate to a pCPU at
   a different clock rate, since cpu_frequency(curcpu()) is cached at
   boot and never recalibrated.

So I wrote a draft replacement for xen/clock.c, attached.  Normally
I'm not keen on wholesale rewrites like this, but it was hard to
empirically study how Xen behaved while also figuring out how
xen/clock.c was supposed to work.  This draft survives light testing
under load in domU and produces a monotonic clock all the way into
userland.

Feedback welcome!  I'm not a Xen wizard.  Is this something that we
should consider adopting for NetBSD/xen, or am I barking up the wrong
trees here to fix the nonmonotonicity of the Xen timecounter?


Some things worth noting:

- This actually provides a monotonic clock.

- The only global ordering happens at highest level of this when
  actually querying the time counter; there's no global mutex for all
  the state, just one atomic operation to advance the latest global
  view of the clock.

  Thus if we ever do add a per-CPU timecounter API, it will be easy to
  adapt this to it: while the global one uses xen_global_systime_ns, a
  local one would simply use xen_vcputime_systime_ns instead, with no
  atomics or anything, just membars to read from the Xen hypervisor.

- This relies on atomic_cas_64.  If that's a problem (does Xen run on
  any hardware without cmpxchg8b?), we could replace it by a mutex --
  not worse than the current code -- or work a little more carefully
  with a 32-bit counter.

- There's no calibration of the vCPU time between vCPUs.  Maybe we
  should do that (particularly if we want to work in a 32-bit
  counter); maybe it's not important because the Xen hypervisor keeps
  system time within a reasonable error of a few ms.

- There's a scattering of prints to diagnose failures of the Xen
  clock.  We could turn these into dtrace SDT probes or event
  counters, perhaps, to reduce printf clutter and help future
  diagnostics.  I haven't yet put in heuristics to decide when to
  tc_gonebad the Xen timecounter.

  So far, I've only seen Xen snapshot of the clock go backwards
  internally by a few hundred nanoseconds (though the system_time
  itself has gone backwards by a few hundred microseconds, but that
  must have been compensated for by advance of the TSC timestamp).

- All per-CPU records are stored in struct cpu_info, not in dense
  arrays, so there's no spurious cache line sharing.

- I preserved logic under #if 0 for hardclock catch-up.  Under #if 1,
  there's logic to print a note when it looks like we may have missed
  some hardclock ticks.  I haven't seen this print fire yet except at
  boot or during ddb.

- xen_delay uses a snapshot of the TSC frequency reported by Xen for
  short (<500us) delays, rather than whatever frequency was computed
  at boot for cpu_frequency(curcpu()).

- I did nothing to respect PVCLOCK_TSC_STABLE_BIT in vcpu_time_info
  flags, since the Xen host I was testing on doesn't seem to support
  it.  Maybe if the hypervisor provides that bit, we could skip some
  of the global ordering overhead.  I would want to examine the
  behaviour of such a clock first, though!

- I made sure to guarantee that references to the current CPU are
  stable, with kasserts.  I think this isn't strictly necessary on Xen
  since kernel preemption is disabled, but this should make it safer
  to enable kernel preemption in the future.

Index: sys/arch/x86/include/cpu.h
===================================================================
RCS file: /cvsroot/src/sys/arch/x86/include/cpu.h,v
retrieving revision 1.80
diff -p -u -r1.80 cpu.h
--- sys/arch/x86/include/cpu.h	9 Oct 2017 17:49:27 -0000	1.80
+++ sys/arch/x86/include/cpu.h	31 Oct 2017 05:08:47 -0000
@@ -225,6 +225,43 @@ struct cpu_info {
 	char		ci_iomap[IOMAPSIZE]; /* I/O Bitmap */
 	int ci_tss_sel;			/* TSS selector of this cpu */
 
+#ifdef XEN
+	/* Xen raw system time at which we last ran hardclock.  */
+	uint64_t	ci_xen_hardclock_systime_ns;
+
+	/*
+	 * Duration in nanoseconds from Xen raw system time of last
+	 * hardclock tick to current tsc-adjusted Xen system time.
+	 * Nonzero only when we're processing delayed hardclock ticks.
+	 */
+	uint64_t	ci_xen_systime_delta_ns;
+
+	/*
+	 * Last Xen raw system time we observed.  Used to detect when
+	 * the Xen raw system time has gone backwards.
+	 */
+	uint64_t	ci_xen_last_raw_systime_ns;
+
+	/*
+	 * Last tsc_timestamp we read from Xen.
+	 */
+	uint64_t	ci_xen_last_tsc_timestamp;
+
+	/*
+	 * Last tsc we read with rdtsc during Xenny stuff.
+	 */
+	uint64_t	ci_xen_last_tsc;
+
+	/*
+	 * Distance in nanoseconds from the local view of system time
+	 * to the global view of system time.
+	 */
+	uint64_t	ci_xen_systime_ns_skew;
+
+	/* Event counter for every hardclock.  (XXX ...Why?)  */
+	struct evcnt	ci_xen_hardclock_evcnt;
+#endif
+
 	/*
 	 * The following two are actually region_descriptors,
 	 * but that would pollute the namespace.

/*	$NetBSD$	*/

/*-
 * Copyright (c) 2017 The NetBSD Foundation, Inc.
 * All rights reserved.
 *
 * This code is derived from software contributed to The NetBSD Foundation
 * by Taylor R. Campbell.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
 * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 */

#include "opt_xen.h"

#include <sys/cdefs.h>
__KERNEL_RCSID(0, "$NetBSD$");

#include <sys/param.h>
#include <sys/types.h>
#include <sys/atomic.h>
#include <sys/callout.h>
#include <sys/cpu.h>
#include <sys/device.h>
#include <sys/evcnt.h>
#include <sys/intr.h>
#include <sys/kernel.h>
#include <sys/lwp.h>
#include <sys/proc.h>
#include <sys/sysctl.h>
#include <sys/systm.h>
#include <sys/time.h>
#include <sys/timetc.h>

#include <dev/clock_subr.h>

#include <machine/cpu.h>
#include <machine/cpu_counter.h>
#include <machine/lock.h>

#include <xen/xen.h>
#include <xen/hypervisor.h>
#include <xen/evtchn.h>
#include <xen/xen-public/vcpu.h>

#include <x86/rtc.h>

#define NS_PER_TICK ((uint64_t)1000000000ULL/hz)

static uint64_t	xen_vcputime_systime_ns(void);
static uint64_t	xen_vcputime_raw_systime_ns(void);
static void	xen_wallclock_time(struct timespec *);
static uint64_t	xen_global_systime_ns(void);
static unsigned	xen_get_timecount(struct timecounter *);
static int	xen_rtc_get(struct todr_chip_handle *, struct timeval *);
static int	xen_rtc_set(struct todr_chip_handle *, struct timeval *);
static int	xen_timer_handler(void *, struct intrframe *);

/*
 * xen timecounter:
 *
 *	Xen vCPU system time, plus an adjustment with rdtsc.
 */
static struct timecounter xen_timecounter = {
	.tc_get_timecount = xen_get_timecount,
	.tc_poll_pps = NULL,
	.tc_counter_mask = ~0U,
	.tc_frequency = 1000000000ULL,	/* 1 GHz, i.e. units of nanoseconds */
	.tc_name = "xen_system_time",
	.tc_quality = 10000,
};

/*
 * xen_global_systime_ns_stamp
 *
 *	The latest Xen vCPU system time that has been observed on any
 *	CPU, for a global monotonic view of the Xen system time clock.
 */
static volatile uint64_t xen_global_systime_ns_stamp __cacheline_aligned;

/*
 * xen time of day register:
 *
 *	Xen wall clock time, plus a Xen vCPU system time adjustment.
 */
static struct todr_chip_handle xen_todr_chip = {
	.todr_gettime = xen_rtc_get,
	.todr_settime = xen_rtc_set,
};

#ifdef DOM0OPS
/*
 * xen timepush state:
 *
 *	Callout to periodically, after a sysctl-configurable number of
 *	NetBSD ticks, set the Xen hypervisor's wall clock time.
 */
static struct {
	struct callout	ch;
	int		ticks;
} xen_timepush;

static void	xen_timepush_init(void);
static void	xen_timepush_intr(void *);
static int	sysctl_xen_timepush(SYSCTLFN_ARGS);
#endif

/*
 * startrtclock()
 *
 *	Initialize the real-time clock from x86 machdep autoconf.
 */
void
startrtclock(void)
{

	todr_attach(&xen_todr_chip);
}

/*
 * setstatclockrate(rate)
 *
 *	Set the statclock to run at rate, in units of ticks per second.
 *
 *	Currently Xen does not have a separate statclock, so this is a
 *	noop; instad the statclock runs in hardclock.
 */
void
setstatclockrate(int rate)
{
}

/*
 * idle_block()
 *
 *	Called from the idle loop when we have nothing to do but wait
 *	for an interrupt.
 */
void
idle_block(void)
{

	KASSERT(curcpu()->ci_ipending == 0);
	HYPERVISOR_block();
}

/*
 * xen_rdtsc_fence()
 *
 *	Issue a {load, xen_rdtsc} before {load, xen_rdtsc} barrier.
 *
 *	In principle, this could be LFENCE if xen_rdtsc did RDTSCP.
 *	However, I'm not sure we can rely on RDTSCP in Xen, if we can
 *	use it at all.
 *
 *	On Intel CPUs, according to the manuals, LFENCE is enough to
 *	order RDTSC; on AMD CPUs, according to the AMD optimization
 *	guides, MFENCE is needed.  If there is ever a chance of
 *	migration to an AMD pCPU, it may be necessary to use MFENCE.
 */
static inline void
xen_rdtsc_fence(void)
{

	x86_mfence();
}

/*
 * xen_rdtsc()
 *
 *	Read the tsc after all instructions before the prior
 *	xen_rdtsc_fence() have completed.
 *
 *	In principle, this could be an RDTSCP, and xen_rdtsc_fence
 *	could be a noop.  However, I'm not sure we can rely on RDTSCP
 *	in Xen, if we can use it at all.
 */
static inline uint64_t
xen_rdtsc(void)
{
	uint32_t lo, hi;

	asm volatile("rdtsc" : "=a"(lo), "=d"(hi));

	return ((uint64_t)hi << 32) | lo;
}

/*
 * struct xen_vcputime_ticket
 *
 *	State for a vCPU read section, during which a caller may read
 *	from fields of a struct vcpu_time_info and call xen_rdtsc.
 *	Caller must enter with xen_vcputime_enter, exit with
 *	xen_vcputime_exit, and be prepared to retry if
 *	xen_vcputime_exit fails.
 */
struct xen_vcputime_ticket {
	uint64_t	version;
};

/*
 * xen_vcputime_enter(tp)
 *
 *	Enter a vCPU time read section and store a ticket in *tp, which
 *	the caller must use with xen_vcputime_exit.  Return a pointer
 *	to the current CPU's vcpu_time_info structure.  Caller must
 *	already be bound to the CPU.
 */
static inline volatile struct vcpu_time_info *
xen_vcputime_enter(struct xen_vcputime_ticket *tp)
{
	volatile struct vcpu_time_info *vt = &curcpu()->ci_vcpu->time;

	while (__predict_false(1 & (tp->version = vt->version)))
		SPINLOCK_BACKOFF_HOOK;
	xen_rdtsc_fence();

	return vt;
}

/*
 * xen_vcputime_exit(vt, tp)
 *
 *	Exit a vCPU time read section with the ticket in *tp from
 *	xen_vcputime_enter.  Return true on success, false if caller
 *	must retry.
 */
static inline bool
xen_vcputime_exit(volatile struct vcpu_time_info *vt,
    struct xen_vcputime_ticket *tp)
{

	KASSERT(vt == &curcpu()->ci_vcpu->time);

	xen_rdtsc_fence();
	return tp->version == vt->version;
}

/*
 * xen_tsc_to_ns_delta(delta_tsc, mul_frac, shift)
 *
 *	Convert a difference in tsc units to a difference in
 *	nanoseconds given a multiplier and shift for the unit
 *	conversion.
 */
static inline uint64_t
xen_tsc_to_ns_delta(uint64_t delta_tsc, uint32_t tsc_to_system_mul,
    int8_t tsc_shift)
{
	uint32_t delta_tsc_hi, delta_tsc_lo;

	if (tsc_shift < 0)
		delta_tsc >>= -tsc_shift;
	else
		delta_tsc <<= tsc_shift;

	delta_tsc_hi = delta_tsc >> 32;
	delta_tsc_lo = delta_tsc & 0xffffffffUL;

	/* d*m/2^32 = (2^32 d_h + d_l)*m/2^32 = d_h*m + (d_l*m)/2^32 */
	return ((uint64_t)delta_tsc_hi * tsc_to_system_mul) +
	    (((uint64_t)delta_tsc_lo * tsc_to_system_mul) >> 32);
}

/*
 * xen_vcputime_systime_ns()
 *
 *	Return a snapshot of the Xen system time plus an adjustment
 *	from the tsc, in units of nanoseconds.  Caller must be bound to
 *	the current CPU.
 */
static uint64_t
xen_vcputime_systime_ns(void)
{
	volatile struct vcpu_time_info *vt;
	struct cpu_info *ci = curcpu();
	struct xen_vcputime_ticket ticket;
	uint64_t raw_systime_ns, tsc_timestamp, tsc, delta_tsc, delta_ns;
	uint64_t cached_delta_tsc, cached_delta_ns;

	/* We'd better be bound to the CPU in _some_ way.  */
	KASSERT(cpu_intr_p() ||
	    cpu_softintr_p() ||
	    kpreempt_disabled() ||
	    (curlwp->l_flag & LP_BOUND));

	/*
	 * Repeatedly try to read the system time, corresponding tsc
	 * timestamp, and tsc frequency until we get a consistent view.
	 */
	do {
		vt = xen_vcputime_enter(&ticket);

		/* Grab Xen's snapshot of raw system time and tsc.  */
		raw_systime_ns = vt->system_time;
		tsc_timestamp = vt->tsc_timestamp;

		/* Read the CPU's tsc.  */
		tsc = xen_rdtsc();

		/* Find how far the CPU's tsc has advanced.  */
		delta_tsc = __predict_false(tsc < tsc_timestamp) ? 0 :
		    tsc - tsc_timestamp;

		/*
		 * Convert the tsc delta to a nanosecond delta by Xen's
		 * current idea of how fast the tsc is counting.
		 */
		delta_ns = xen_tsc_to_ns_delta(delta_tsc,
		    vt->tsc_to_system_mul, vt->tsc_shift);

		/*
		 * Estimate how many nanoseconds have elapsed from the
		 * last tsc timestamp.
		 */
		cached_delta_tsc =
		    __predict_false(tsc_timestamp <
			ci->ci_xen_last_tsc_timestamp)
		    ? ci->ci_xen_last_tsc_timestamp - tsc_timestamp : 0;
		cached_delta_ns = xen_tsc_to_ns_delta(cached_delta_tsc,
		    vt->tsc_to_system_mul, vt->tsc_shift);
	} while (!xen_vcputime_exit(vt, &ticket));

	/*
	 * Notify the console if the Xen hypervisor's raw system_time
	 * ran backwards.  This shouldn't happen because the Xen
	 * hypervisor is supposed to be smarter than that.
	 *
	 * Also notify the console if advance in raw system time is
	 * negated by a larger retreat in tsc.
	 */
	if (__predict_false(raw_systime_ns < ci->ci_xen_last_raw_systime_ns)) {
#if 0				/* XXX too noisy */
		printf("xen vcpu_time_info system_time ran backwards"
		    " %"PRIu64"ns\n",
		    ci->ci_xen_last_raw_systime_ns - raw_systime_ns);
#endif
	} else if (__predict_false((raw_systime_ns -
		    ci->ci_xen_last_raw_systime_ns)
		< cached_delta_ns)) {
		printf("xen system time advanced but tsc retreated more:"
		    " systime delta %"PRIu64"ns <"
		    " tsc timestamp delta %"PRIu64" = "
		    " %"PRIu64"ns\n",
		    raw_systime_ns - ci->ci_xen_last_raw_systime_ns,
		    cached_delta_tsc,
		    cached_delta_ns);
	}

	/*
	 * Notify the console if the CPU's tsc ran backwards.  This
	 * shouldn't happen because the CPU tsc isn't supposed to
	 * change, although maybe in cases of migration it will.
	 */
	if (__predict_false(tsc < ci->ci_xen_last_tsc)) {
#if 0				/* XXX too noisy */
		printf("xen cpu tsc ran backwards %"PRIu64"\n",
		    ci->ci_xen_last_tsc - tsc);
#endif
	}

	/*
	 * Notify the console if the CPU's tsc appeared to run behind
	 * Xen's idea of the tsc.  This shouldn't happen because the
	 * Xen hypervisor is supposed to have read the tsc _before_
	 * writing to the vcpu_time_info page, _before_ we read the
	 * tsc.  Further, if we switched pCPUs after reading the tsc
	 * timestamp but before reading the CPU's tsc, the hypervisor
	 * had better notify us by updating the version too and forcing
	 * us to retry the vCPU time read.
	 */
	if (__predict_false(tsc < tsc_timestamp)) {
		printf("xen cpu tsc %"PRIu64
		    " ran backwards from timestamp %"PRIu64
		    " by %"PRIu64"\n",
		    tsc, tsc_timestamp, tsc_timestamp - tsc);
	}

	/*
	 * Notify the console if the delta computation yielded a
	 * negative.  This doesn't make sense but I include it out of
	 * paranoia.
	 */
	if (__predict_false((int64_t)delta_ns < 0)) {
		printf("xen tsc delta in ns went negative: %"PRId64"\n",
		    delta_ns);
	}

	/*
	 * Notify the console if the addition will wrap around.
	 */
	if (__predict_false((raw_systime_ns + delta_ns) < raw_systime_ns)) {
		printf("xen systime + delta wrapped around:"
		    " %"PRIu64" + %"PRIu64" = %"PRIu64"\n",
		    raw_systime_ns, delta_ns, raw_systime_ns + delta_ns);
	}

	/* Remember the various timestamps.  */
	ci->ci_xen_last_raw_systime_ns = raw_systime_ns;
	ci->ci_xen_last_tsc_timestamp = tsc_timestamp;
	ci->ci_xen_last_tsc = tsc;

	/* We had better not have migrated CPUs.  */
	KASSERT(ci == curcpu());

	/* Add the delta to the raw system, in nanoseconds.  */
	return raw_systime_ns + delta_ns;
}

/*
 * xen_vcputime_raw_systime_ns()
 *
 *	Return a snapshot of the current Xen system time to the
 *	resolution of the Xen hypervisor tick, in units of nanoseconds.
 */
static uint64_t
xen_vcputime_raw_systime_ns(void)
{
	volatile struct vcpu_time_info *vt;
	struct xen_vcputime_ticket ticket;
	uint64_t raw_systime_ns;

	do {
		vt = xen_vcputime_enter(&ticket);
		raw_systime_ns = vt->system_time;
	} while (!xen_vcputime_exit(vt, &ticket));

	return raw_systime_ns;
}

/*
 * struct xen_wallclock_ticket
 *
 *	State for a wall clock read section, during which a caller may
 *	read from the wall clock fields of HYPERVISOR_shared_info.
 *	Caller must enter with xen_wallclock_enter, exit with
 *	xen_wallclock_exit, and be prepared to retry if
 *	xen_wallclock_exit fails.
 */
struct xen_wallclock_ticket {
	uint32_t version;
};

/*
 * xen_wallclock_enter(tp)
 *
 *	Enter a wall clock read section and store a ticket in *tp,
 *	which the caller must use with xen_wallclock_exit.
 */
static inline void
xen_wallclock_enter(struct xen_wallclock_ticket *tp)
{

	while (1 & (tp->version = HYPERVISOR_shared_info->wc_version))
		SPINLOCK_BACKOFF_HOOK;
	membar_consumer();
}

/*
 * xen_wallclock_exit(tp)
 *
 *	Exit a wall clock read section with the ticket in *tp from
 *	xen_wallclock_enter.  Return true on success, false if caller
 *	must retry.
 */
static inline bool
xen_wallclock_exit(struct xen_wallclock_ticket *tp)
{

	membar_consumer();
	return tp->version == HYPERVISOR_shared_info->wc_version;
}

/*
 * xen_wallclock_time(tsp)
 *
 *	Return a snapshot of the current low-resolution wall clock
 *	time, as reported by the hypervisor, in tsp.
 */
static void
xen_wallclock_time(struct timespec *tsp)
{
	struct xen_wallclock_ticket ticket;
	uint64_t systime_ns;

	/* Read the last wall clock sample from the hypervisor. */
	do {
		xen_wallclock_enter(&ticket);
		tsp->tv_sec = HYPERVISOR_shared_info->wc_sec;
		tsp->tv_nsec = HYPERVISOR_shared_info->wc_nsec;
	} while (!xen_wallclock_exit(&ticket));

	/* Get the global system time.  */
	systime_ns = xen_global_systime_ns();

	/* Add the system time to the wall clock time.  */
	systime_ns += tsp->tv_nsec;
	tsp->tv_sec += systime_ns / 1000000000ull;
	tsp->tv_nsec = systime_ns % 1000000000ull;
}

/*
 * xen_global_systime_ns()
 *
 *	Return a global monotonic view of the system time in
 *	nanoseconds, computed by the per-CPU Xen raw system time plus
 *	an rdtsc adjustment, and advance the view of the system time
 *	for all other CPUs.
 */
static uint64_t
xen_global_systime_ns(void)
{
	struct cpu_info *ci;
	uint64_t local, global, result;
	int bound;

	/*
	 * Find the local timecount on this CPU, and make sure it does
	 * not precede the latest global timecount witnessed so far by
	 * any CPU.  If it does, add to the local CPU's skew from the
	 * fastest CPU.
	 *
	 * XXX Can we avoid retrying if the CAS fails?
	 */
	bound = curlwp_bind();
	ci = curcpu();
	do {
		local = xen_vcputime_systime_ns();
		local += ci->ci_xen_systime_ns_skew;
		global = xen_global_systime_ns_stamp;
		if (__predict_false(local < global + 1)) {
			result = global + 1;
			ci->ci_xen_systime_ns_skew += global + 1 - local;
		} else {
			result = local;
		}
	} while (atomic_cas_64(&xen_global_systime_ns_stamp, global, result)
	    != global);
	curlwp_bindx(bound);

	return result;
}

/*
 * xen_get_timecount(tc)
 *
 *	Return the low 32 bits of a global monotonic view of the Xen
 *	system time.
 */
static unsigned
xen_get_timecount(struct timecounter *tc)
{

	KASSERT(tc == &xen_timecounter);

	return (unsigned)xen_global_systime_ns();
}

/*
 * xen_rtc_get(todr, tv)
 *
 *	Get the current real-time clock from the Xen wall clock time
 *	and vCPU system time adjustment.
 */
static int
xen_rtc_get(struct todr_chip_handle *todr, struct timeval *tvp)
{
	struct timespec ts;

	xen_wallclock_time(&ts);
	TIMESPEC_TO_TIMEVAL(tvp, &ts);

	return 0;
}

/*
 * xen_rtc_set(todr, tv)
 *
 *	Set the Xen wall clock time, if we can.
 */
static int
xen_rtc_set(struct todr_chip_handle *todr, struct timeval *tvp)
{
#ifdef DOM0OPS
	struct clock_ymdhms dt;
#if __XEN_INTERFACE_VERSION__ < 0x00030204
	dom0_op_t op;
#else
	xen_platform_op_t op;
#endif
	uint64_t systime_ns;

	if (xendomain_is_privileged()) {
		/* Convert to ymdhms and set the x86 ISA RTC.  */
		clock_secs_to_ymdhms(tvp->tv_sec, &dt);
		rtc_set_ymdhms(NULL, &dt);

		/* Get the global system time so we can preserve it.  */
		systime_ns = xen_global_systime_ns();

		/* Set the hypervisor wall clock time.  */
		op.u.settime.secs = tvp->tv_sec;
		op.u.settime.nsecs = tvp->tv_usec * 1000;
		op.u.settime.system_time = systime_ns;
#if __XEN_INTERFACE_VERSION__ < 0x00030204
		op.cmd = DOM0_SETTIME;
		return HYPERVISOR_dom0_op(&op);
#else
		op.cmd = XENPF_settime;
		return HYPERVISOR_platform_op(&op);
#endif
	}
#endif

	/* XXX Should this fail if not on privileged dom0?  */
	return 0;
}

/*
 * xen_delay(n)
 *
 *	Wait approximately n microseconds.
 */
void
xen_delay(unsigned n)
{
	int bound;

	/* Bind to the CPU so we don't compare tsc on different CPUs.  */
	bound = curlwp_bind();

	/* Short wait (<500us) or long wait?  */
	if (n < 500000) {
		/*
		 * Xen system time is not precise enough for short
		 * delays, so use the tsc instead.
		 *
		 * We work with the current tsc frequency, and figure
		 * that if it changes while we're delaying, we've
		 * probably delayed long enough -- up to 500us.
		 *
		 * We do not use cpu_frequency(ci), which uses a
		 * quantity detected at boot time, and which may have
		 * changed by now if Xen has migrated this vCPU to
		 * another pCPU.
		 */
		volatile struct vcpu_time_info *vt;
		struct xen_vcputime_ticket ticket;
		uint64_t tsc;
		uint32_t tsc_to_system_mul;
		int8_t tsc_shift;

		/* Get the starting tsc and tsc frequency.  */
		do {
			vt = xen_vcputime_enter(&ticket);
			tsc = xen_rdtsc();
			tsc_to_system_mul = vt->tsc_to_system_mul;
			tsc_shift = vt->tsc_shift;
		} while (!xen_vcputime_exit(vt, &ticket));

		/*
		 * Wait until the as many tsc ticks as there are in n
		 * microseconds have elapsed.
		 */
		while (xen_tsc_to_ns_delta(xen_rdtsc() - tsc,
			tsc_to_system_mul, tsc_shift)/1000 < n)
			xen_rdtsc_fence();
	} else {
		/* Use the Xen system time.  */
		uint64_t start, end;

		/*
		 * Get the start and end times.
		 *
		 * Nanoseconds since boot takes centuries to overflow,
		 * so no need to worry about wrapping.  We do not
		 * bother with tsc adjustment for delays this long.
		 *
		 * XXX Do we ever need to issue delays this long?  That
		 * seems likely to be a bug.
		 */
		start = xen_vcputime_raw_systime_ns();
		end = start + 1000*(uint64_t)n;

		/* Wait until the system time has passed the end.  */
		do {
			HYPERVISOR_yield();
		} while (xen_vcputime_raw_systime_ns() < end);
	}

	/* Unbind from the CPU if we weren't already bound.  */
	curlwp_bindx(bound);
}

/*
 * xen_suspendclocks(ci)
 *
 *	Stop handling the Xen timer event on the CPU of ci.  Caller
 *	must be running on and bound to ci's CPU.
 *
 *	Actually, caller must have kpreemption disabled, because that's
 *	easier to assert at the moment.
 */
void
xen_suspendclocks(struct cpu_info *ci)
{
	int evtch;

	KASSERT(ci == curcpu());
	KASSERT(kpreempt_disabled());

	evtch = unbind_virq_from_evtch(VIRQ_TIMER);
	KASSERT(evtch != -1);

	hypervisor_mask_event(evtch);
	/* XXX sketchy function pointer cast */
	event_remove_handler(evtch, (int (*)(void *))xen_timer_handler, ci);

	aprint_verbose("Xen clock: removed event channel %d\n", evtch);
}

/*
 * xen_resumeclocks(ci)
 *
 *	Start handling the Xen timer event on the CPU of ci.  Caller
 *	must be running on and bound to ci's CPU.
 *
 *	Actually, caller must have kpreemption disabled, because that's
 *	easier to assert at the moment.
 */
void
xen_resumeclocks(struct cpu_info *ci)
{
	int evtch;

	KASSERT(ci == curcpu());
	KASSERT(kpreempt_disabled());

	evtch = bind_virq_to_evtch(VIRQ_TIMER);
	KASSERT(evtch != -1);

	/* XXX sketchy function pointer cast */
	event_set_handler(evtch, (int (*)(void *))xen_timer_handler,
	    ci, IPL_CLOCK, "clock");
	hypervisor_enable_event(evtch);

	aprint_verbose("Xen clock: using event channel %d\n", evtch);
}

/*
 * xen_timer_handler(cookie, regs)
 *
 *	Periodic Xen timer event handler for NetBSD hardclock.  Calls
 *	to this may get delayed, so we run hardclock as many times as
 *	we need to in order to cover the Xen system time that elapsed.
 *	After that, re-arm the timer to run again at the next tick.
 */
static int
xen_timer_handler(void *cookie, struct intrframe *regs)
{
	struct cpu_info *ci = cookie;
	uint64_t last, now, delta, next;
	int error;

	KASSERT(cpu_intr_p());
	KASSERT(ci == curcpu());

again:
	/*
	 * Find how many nanoseconds of Xen system time has elapsed
	 * since the last hardclock tick.
	 */
	last = ci->ci_xen_hardclock_systime_ns;
	now = xen_vcputime_systime_ns();
	if (now < last) {
		printf("xen systime ran backwards in hardclock %"PRIu64"ns\n",
		    last - now);
		now = last;
	}
	delta = now - last;

#if 1
	if (delta >= NS_PER_TICK) {
		delta -= NS_PER_TICK;
		if (delta >= NS_PER_TICK) {
			printf("missed %"PRId64" hardclock ticks\n",
			    delta/NS_PER_TICK);
		}
		ci->ci_xen_hardclock_systime_ns = now;
		ci->ci_xen_systime_delta_ns = 0;
		hardclock((struct clockframe *)regs);
		ci->ci_xen_hardclock_evcnt.ev_count++;
	}
#else
	/*
	 * Run the hardclock timer as many times as necessary.  We
	 * maintain the charade that the Xen system time is as if we
	 * ticked every NS_PER_TICK nanoseconds exactly, by setting
	 * ci->ci_xen_systime_delta_ns to the current delta from the
	 * theoretical hardclock tick system time and the current
	 * system time.
	 */
	while (delta >= NS_PER_TICK) {
		ci->ci_xen_hardclock_systime_ns += NS_PER_TICK;
		ci->ci_xen_systime_delta_ns = (delta -= NS_PER_TICK);
		hardclock((struct clockframe *)regs);
		ci->ci_xen_hardclock_evcnt.ev_count++;
	}
#endif

	/*
	 * Re-arm the timer.  If it fails, it's probably because the
	 * time is in the past, so update our idea of what the Xen
	 * system time is and try again.
	 */
	next = ci->ci_xen_hardclock_systime_ns + NS_PER_TICK;
	error = HYPERVISOR_set_timer_op(next);
	if (error)
		goto again;

	/*
	 * Done with the charade about the Xen system time.  Restore
	 * the Xen system time delta to zero.
	 */
	ci->ci_xen_systime_delta_ns = 0;

	/* Success!  */
	return 0;
}

/*
 * xen_initclocks()
 *
 *	Initialize the Xen clocks on the current CPU.
 */
void
xen_initclocks(void)
{
	struct cpu_info *ci = curcpu();
	int error;

	/* If this is the primary CPU, do global initialization first.  */
	if (ci == &cpu_info_primary) {
		/* Initialize the systemwide Xen timecounter.  */
		tc_init(&xen_timecounter);

#ifdef DOM0OPS
		/*
		 * If this is a privileged dom0, start pushing the wall
		 * clock time back to the Xen hypervisor.
		 */
		if (xendomain_is_privileged())
			xen_timepush_init();
#endif
	}

	/* Pretend the last hardclock happened right now.  */
	ci->ci_xen_hardclock_systime_ns = xen_vcputime_systime_ns();
	ci->ci_xen_systime_delta_ns = 0;

	/* Attach the hardclock event counter.  */
	evcnt_attach_dynamic(&ci->ci_xen_hardclock_evcnt, EVCNT_TYPE_INTR,
	    NULL, device_xname(ci->ci_dev), "hardclock");

	/* Disarm the periodic timer on Xen>=3.1 which is allegedly buggy.  */
	if (XEN_MAJOR(xen_version) > 3 || XEN_MINOR(xen_version) > 0) {
		error = HYPERVISOR_vcpu_op(VCPUOP_stop_periodic_timer,
		    ci->ci_cpuid, NULL);
		KASSERT(error == 0);
	}

	/* Arm the timer.  */
	error = HYPERVISOR_set_timer_op(ci->ci_xen_hardclock_systime_ns +
	    NS_PER_TICK);
	KASSERT(error == 0);

	/* Fire up the clocks.  */
	xen_resumeclocks(ci);
}

#ifdef DOM0OPS

/*
 * xen_timepush_init()
 *
 *	Initialize callout to periodically set Xen hypervisor's wall
 *	clock time.
 */
static void
xen_timepush_init(void)
{
	struct sysctllog *log = NULL;
	const struct sysctlnode *node = NULL;
	int error;

	/* Start periodically updating the hypervisor's wall clock time.  */
	callout_init(&xen_timepush.ch, 0);
	callout_setfunc(&xen_timepush.ch, xen_timepush_intr, NULL);

	/* Pick a default frequency for timepush.  */
	xen_timepush.ticks = 53*hz + 3; /* avoid exact # of min/sec */

	/* Create machdep.xen node.  */
	/* XXX Creation of the `machdep.xen' node should be elsewhere.  */
	error = sysctl_createv(&log, 0, NULL, &node, 0,
	    CTLTYPE_NODE, "xen",
	    SYSCTL_DESCR("Xen top level node"),
	    NULL, 0, NULL, 0,
	    CTL_MACHDEP, CTL_CREATE, CTL_EOL);
	if (error)
		goto fail;
	KASSERT(node != NULL);

	/* Create int machdep.xen.timepush_ticks knob.  */
	error = sysctl_createv(&log, 0, NULL, NULL, CTLFLAG_READWRITE,
	    CTLTYPE_INT, "timepush_ticks",
	    SYSCTL_DESCR("How often to update the hypervisor's time-of-day;"
		" 0 to disable"),
	    sysctl_xen_timepush, 0, &xen_timepush.ticks, 0,
	    CTL_CREATE, CTL_EOL);
	if (error)
		goto fail;

	/* Start the timepush callout.  */
	callout_schedule(&xen_timepush.ch, xen_timepush.ticks);

	/* Success!  */
	return;

fail:	sysctl_teardown(&log);
}

/*
 * xen_timepush_intr(cookie)
 *
 *	Callout interrupt handler to push NetBSD's idea of the wall
 *	clock time, usually synchronized with NTP, back to the Xen
 *	hypervisor.
 */
static void
xen_timepush_intr(void *cookie)
{

	resettodr();
	if (xen_timepush.ticks)
		callout_schedule(&xen_timepush.ch, xen_timepush.ticks);
}

/*
 * sysctl_xen_timepush(...)
 *
 *	Sysctl handler to set machdep.xen.timepush_ticks.
 */
static int
sysctl_xen_timepush(SYSCTLFN_ARGS)
{
	struct sysctlnode node;
	int ticks;
	int error;

	ticks = xen_timepush.ticks;
	node = *rnode;
	node.sysctl_data = &ticks;
	error = sysctl_lookup(SYSCTLFN_CALL(&node));
	if (error || newp == NULL)
		return error;

	if (ticks < 0)
		return EINVAL;

	if (ticks != xen_timepush.ticks) {
		xen_timepush.ticks = ticks;

		if (ticks == 0)
			callout_stop(&xen_timepush.ch);
		else
			callout_schedule(&xen_timepush.ch, ticks);
	}

	return 0;
}

#endif	/* DOM0OPS */

Prev by Date: enabling dtrace on xen
Next by Date: xen, x86 intr code rototil alert.
Previous by Thread: enabling dtrace on xen
Next by Thread: xen, x86 intr code rototil alert.
Indexes:

Home | Main Index | Thread Index | Old Index