Subject: Re: Strange numbers from gettimeofday(2)
To: Jed Davis <jdev@panix.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-xen
Date: 01/22/2006 23:21:26
--2fHTh5uZTiUOsy+g
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi,

On Fri, Jan 13, 2006 at 11:21:58PM -0500, Jed Davis wrote:
> It's worse than that, actually.  The problem AFAICT is in the MI
> cc_microtime (in kern_microtime.c):
> 
>         t.tv_usec += (cc * ci->ci_cc_ms_delta) / ci->ci_cc_denom;
> 	while (t.tv_usec >= 1000000) {
> 		t.tv_usec -= 1000000;
> 		t.tv_sec++;
> 	}
> 
> It looks like the RHS of the += winds up as complete garbage, and
> sometimes the mutiply overflows and winds up making tv_usec negative.
> [...]
> Oh, of course: It will *definitely* break if the domain is paused or
> in ddb for more than a second, because xen_timer_handler will call
> cc_microset multiple times, passing it time(9) each time, which is
> dutifully being advanced by the stacked-up calls to hardclock(9).  So
> the second time cc_microset sees that more or less a second has
> passed, but very few CPU cycles have; it (if I read it correctly) then
> estimates a very low CPU speed, and for the next second of real time
> cc_microtime gets absurdly large values which overflow and cause
> negativeness.

Yes, in this case ci_cc_denom is small, and so the result can
be very large.

> Now, &cc_microtime gets set as microtime_func because a TSC is
> detected in arch/xen/i386/identcpu.c, though I don't know that Xen
> will work on anything without a TSC; my first thought was that it

There would eventually be the Cyrix CPUs which have a broken TSC; I don't
think Xen can reasonably work on such a CPU.


> might be desirable to disable that and always use xen_microtime.
> 
> However: xen_microtime would need to call get_tsc_offset_ns() to be of
> any use, and really shouldn't need to call get_time_values_from_xen()
> as long as the timer event handler does, since (according to the docs)
> a timer event will be asserted whenever a domain becomes scheduled.

After more thoughs, cc_microtime() is fine for Xen, exept that it expects
ci_cc.cc_cc contains counter_cpu32() (see my post to tech-kern).
The problem is cc_microset(), which expects to be called exactly every second.

In the attached patch, I implemented a microset which should work as expected
for Xen. I tested even after a break in ddb of several 10s of seconds,
it just resets itself and the time isn't corrupted. However, my test systems
have low CPU speed (< 400Mhz), so It's much less likely to overflow a 32bit
cycle counter than on a 3Ghz CPU. Can you test it and see if it works for
you ? Note thar there is a patch for kern_microtime.c too.

> 
> Problem 1: It's still completely ignoring time(9), which is very wrong
> AIUI; and resettodr() is a no-op, which makes it worse.

AFAIK, time(9) is properly handled by hardclock(), and resettodr() can't
be anything but a no-op on a domU.

> 
> Problem 2: I tried this, and the shadow_tv was 42 seconds fast and
> drifting slowly but noticeably forward; the dom0 was running ntpdate
> from cron and was on time.  (Hm... the DOM0_SETTIME is called only
> when resettodr() is, so if the drift is small enough for ntpdate to
> always use adjtime(2), it won't ever correct Xen's time?)

Maybe not, until the system is rebooted. But I'm not sure it's desireable to
correct the time this way. This would make the time go sometime backward
in the domU.
I checked Xen3, there isn't a way to correct the clock drift in the
hypervisor. We can just set the time to a new value.

> 
> Problem 3: yamt mentioned on ICB that the tsc_timestamp in the
> shared_info page might not be right for the current CPU; I haven't
> checked on what Xen actually does here yet, but it seems to me that
> that would be a bug, as the timestamp is useless if it's from
> undefined physical CPU that we can't access.

Hum, at last for Xen3, tsc_timestamp is updated just before the context
switch, so it should be OK. Xen2 looks similar.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--

--2fHTh5uZTiUOsy+g
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff2

Index: kern/kern_microtime.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_microtime.c,v
retrieving revision 1.15
diff -u -r1.15 kern_microtime.c
--- kern/kern_microtime.c	11 Dec 2005 12:24:29 -0000	1.15
+++ kern/kern_microtime.c	22 Jan 2006 20:11:47 -0000
@@ -112,7 +112,7 @@
 		 */
 		simple_lock(&microtime_slock);
 		t = ci->ci_cc_time;
-		cc = cpu_counter32() - ci->ci_cc_cc;
+		cc = cpu_counter32() - (ci->ci_cc_cc & 0xffffffff);
 		if (cc < 0)
 			cc += 0x100000000LL;
 		t.tv_usec += (cc * ci->ci_cc_ms_delta) / ci->ci_cc_denom;
Index: arch/xen/i386/machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/xen/i386/machdep.c,v
retrieving revision 1.24
diff -u -r1.24 machdep.c
--- arch/xen/i386/machdep.c	15 Jan 2006 22:09:51 -0000	1.24
+++ arch/xen/i386/machdep.c	22 Jan 2006 20:11:47 -0000
@@ -289,7 +289,7 @@
 void (*initclock_func)(void) = i8254_initclocks;
 #else
 void (*delay_func)(int) = xen_delay;
-void (*microtime_func)(struct timeval *) = xen_microtime;
+void (*microtime_func)(struct timeval *) = cc_microtime;
 void (*initclock_func)(void) = xen_initclocks;
 #endif
 
Index: arch/xen/include/cpu.h
===================================================================
RCS file: /cvsroot/src/sys/arch/xen/include/cpu.h,v
retrieving revision 1.11
diff -u -r1.11 cpu.h
--- arch/xen/include/cpu.h	24 Dec 2005 20:07:48 -0000	1.11
+++ arch/xen/include/cpu.h	22 Jan 2006 20:11:47 -0000
@@ -386,7 +386,6 @@
 #else
 void	startrtclock(void);
 void	xen_delay(int);
-void	xen_microtime(struct timeval *);
 void	xen_initclocks(void);
 #endif
 
Index: arch/xen/xen/clock.c
===================================================================
RCS file: /cvsroot/src/sys/arch/xen/xen/clock.c,v
retrieving revision 1.16
diff -u -r1.16 clock.c
--- arch/xen/xen/clock.c	15 Jan 2006 22:09:52 -0000	1.16
+++ arch/xen/xen/clock.c	22 Jan 2006 20:11:47 -0000
@@ -52,13 +52,13 @@
 #include "config_time.h"		/* for CONFIG_TIME */
 
 static int xen_timer_handler(void *, struct intrframe *);
+static uint64_t get_tsc_offset_ns(void);
+
+#define DEBUG_CLOCK
 
 /* These are peridically updated in shared_info, and then copied here. */
 volatile static uint64_t shadow_tsc_stamp;
 volatile static uint64_t shadow_system_time;
-#ifndef XEN3
-volatile static unsigned long shadow_time_version;
-#endif
 volatile static struct timeval shadow_tv;
 
 static int timeset;
@@ -95,6 +95,7 @@
 	    (tversion != HYPERVISOR_shared_info->wc_version));
 	shadow_tv.tv_usec = shadow_tv.tv_usec / 1000;
 #else /* XEN3 */
+	unsigned long shadow_time_version;
 	do {
 		shadow_time_version = HYPERVISOR_shared_info->time_version2;
 		__insn_barrier();
@@ -108,13 +109,13 @@
 }
 
 static uint64_t
-get_tsc_offset_ns(void)
+get_tsc_offset_ns()
 {
-	uint32_t tsc_delta;
+	uint64_t tsc_delta;
 	struct cpu_info *ci = curcpu();
 
-	tsc_delta = cpu_counter32() - shadow_tsc_stamp;
-	return tsc_delta * 1000000000 / cpu_frequency(ci);
+	tsc_delta = cpu_counter() - shadow_tsc_stamp;
+	return (tsc_delta * 1000000000ULL) / cpu_frequency(ci);
 }
 
 void
@@ -141,8 +142,11 @@
 #ifdef DEBUG_CLOCK
 	printf("readclock: %ld (%ld)\n", time.tv_sec, base);
 #endif
-	/* reset microset, so that the next call to microset() will init */
-	ci->ci_cc.cc_denom = 0;
+	/* initialize microset */
+	ci->ci_cc.cc_denom = cpu_frequency(ci);
+	cc_microset_time = ci->ci_cc.cc_time = time;
+	ci->ci_cc.cc_ms_delta = 1000000;
+	ci->ci_cc.cc_cc = cpu_counter();
 
 	if (base != 0 && base < time.tv_sec - 5*SECYR)
 		printf("WARNING: file system time much less than clock time\n");
@@ -216,17 +220,18 @@
 void
 xen_delay(int n)
 {
-	if (n < 500000) {
+	if (n < (10000000 / hz)) { /* less than 10 clock ticks */
 		/*
-		 * shadow_system_time is updated every hz tick, it's not
-		 * precise enouth for short delays. Use the CPU counter
-		 * instead. We assume it's working at this point.
+		 * shadow_system_time is updated at best every hz tick,
+		 * it's not precise enouth for short delays. Use the CPU
+		 * counter instead. We assume it's working at this point.
 		 */
-		u_int64_t cc, cc2, when;
+		u_int64_t cc, cc2, when, ticks_per_us;
 		struct cpu_info *ci = curcpu();
 
+		ticks_per_us = cpu_frequency(ci) / 1000000LL;
 		cc = cpu_counter();
-		when = cc + (u_int64_t)n * cpu_frequency(ci) / 1000000LL;
+		when = cc + (u_int64_t)n * ticks_per_us;
 		if (when < cc) {
 			/* wait for counter to wrap */
 			cc2 = cpu_counter();
@@ -239,26 +244,28 @@
 		
 		return;
 	} else {
-		uint64_t when;
+		uint64_t now, when;
 
-		/* for large delays, shadow_system_time is OK */
+		/*
+		 * for large delays, we can use shadow_system_time and
+		 * setup a timer so that the CPU can eventually be used by
+		 * other domains. Note that we can't use set_timer/block
+		 * because interrupts may be disabled.
+		 */
 		get_time_values_from_xen();
-		when = shadow_system_time + n * 1000;
-		while (shadow_system_time < when)
+		now = shadow_system_time + get_tsc_offset_ns();
+		when = now + n * 1000;
+		while (now < when) {
+			HYPERVISOR_yield();
+			__insn_barrier();
 			get_time_values_from_xen();
+			now = shadow_system_time + get_tsc_offset_ns();
+			__insn_barrier();
+		}
 	}
 }
 
 void
-xen_microtime(struct timeval *tv)
-{
-	int s = splclock();
-	get_time_values_from_xen();
-	*tv = shadow_tv;
-	splx(s);
-}
-
-void
 xen_initclocks()
 {
 	int evtch = bind_virq_to_evtch(VIRQ_TIMER);
@@ -277,33 +284,64 @@
 xen_timer_handler(void *arg, struct intrframe *regs)
 {
 	int64_t delta;
-	static int microset_iter = 0; /* call cc_microset once/sec */
+	static int microset_iter = 0; /* do microset about once/sec */
 	struct cpu_info *ci = curcpu();
-	
+
 	get_time_values_from_xen();
 
 	delta = (int64_t)(shadow_system_time + get_tsc_offset_ns() -
-			  processed_system_time);
+		  	processed_system_time);
+
 	while (delta >= NS_PER_TICK) {
-		if (ci->ci_feature_flags & CPUID_TSC) {
-			if (
 #if defined(MULTIPROCESSOR)
-		 	   CPU_IS_PRIMARY(ci) &&
+		if (CPU_IS_PRIMARY(ci))
 #endif
-			    (microset_iter--) == 0) {
-				microset_iter = hz - 1;
-#if defined(MULTIPROCESSOR)
-				x86_broadcast_ipi(X86_IPI_MICROSET);
-#endif
-				cc_microset_time = time;
-				cc_microset(ci);
-			}
-		}
+				microset_iter--;
 		hardclock((struct clockframe *)regs);
 		delta -= NS_PER_TICK;
 		processed_system_time += NS_PER_TICK;
 	}
 
+	/*
+	 * We can't use cc_microset here because it expect to be called
+	 * exactly once per second, which isn't exactly true for Xen.
+	 * The code below is an approximation which should work.
+	 */
+	if (microset_iter <= 0) {
+		struct timeval *t = &cc_microset_time;
+		int64_t denom, diff;
+
+		cc_microset_time = time;
+		microset_iter = hz - 1;
+#if defined(MULTIPROCESSOR)
+		x86_broadcast_ipi(X86_IPI_MICROSET);
+#endif
+		if (ci->ci_cc.cc_denom == 0) /* not ready yet */
+			return 0;
+
+		denom = ci->ci_cc.cc_cc;
+		ci->ci_cc.cc_cc = cpu_counter();
+
+		denom = ci->ci_cc.cc_cc - denom;
+		delta = (t->tv_sec - ci->ci_cc.cc_time.tv_sec) * 1000000 +
+		    (t->tv_usec - ci->ci_cc.cc_time.tv_usec);
+		ci->ci_cc.cc_time = cc_microset_time;
+
+		/* make sure values are not completely false */
+		diff = (denom * 1000000LL) / cpu_frequency(ci) /* us from cc */
+		    - delta; /* us from clock */
+		if (diff > -500000 && diff < 500000) {
+			ci->ci_cc.cc_ms_delta = delta;
+			ci->ci_cc.cc_denom = denom;
+		} else {
+			printf("microset[%lu]: diff  %" PRId64
+			    ", resetting state\n", (u_long)ci->ci_cpuid,
+			    diff);
+			ci->ci_cc.cc_ms_delta = 1000000;
+			ci->ci_cc.cc_denom = cpu_frequency(ci);
+		}
+	}
+
 	return 0;
 }
 
@@ -324,4 +362,6 @@
 	 */
 	if (HYPERVISOR_set_timer_op(processed_system_time + NS_PER_TICK) == 0)
 		HYPERVISOR_block();
+	else
+		printf("idle_block: set_timer_op failed\n");
 }

--2fHTh5uZTiUOsy+g--