So, I know now why we want to use "dom0_vcpus_pin=true" w.r.t. timekeeping!
I updated xen_clock.c to 1.18 and turned on XEN_CLOCK_DEBUG (and then
commented out one of the super-noisy device_printf() calls that actually
caused the system to hang) and I started seeing thousands of printfs
like the following, but only on dom0, and only on the one machine where
I didn't have dom0's CPUs pinned.
[ 83329.4245423] xen raw systime + tsc delta went backwards: 82591317579681 > 82591299251748
[ 83329.4245423] raw_systime_ns=82590641756625
[ 83329.4245423] tsc_timestamp=233578790859082
[ 83329.4245423] tsc=233580649104491
[ 83329.4245423] tsc_to_system_mul=3039340271
[ 83329.4245423] tsc_shift=-1
[ 83329.4245423] delta_tsc=1858245409
[ 83329.4245423] delta_ns=657495123
Make that hundreds of thousands in less than a day:
# uptime
4:33PM up 23:26, 2 users, load averages: 0.08, 0.02, 0.01
vcpu0 raw systime went backwards 395276 4 intr
vcpu0 missed hardclock 423534 5 intr
vcpu0 timecounter went backwards 242583 2 intr
vcpu1 raw systime went backwards 261025 3 intr
vcpu1 missed hardclock 462819 5 intr
vcpu1 timecounter went backwards 256918 3 intr
Also time drifted.....
# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
xentastic.local 192.75.191.16 3 u 105 256 377 0.469 -724851 6055.97
So I pinned them at runtime:
# xl vcpu-list Domain-0
Name ID VCPU CPU State Time(s) Affinity (Hard / Soft)
Domain-0 0 0 3 r-- 806.0 all / all
Domain-0 0 1 2 -b- 715.6 all / all
# xl vcpu-pin 0 0 0
# xl vcpu-pin 0 1 1
# xl vcpu-list Domain-0
Name ID VCPU CPU State Time(s) Affinity (Hard / Soft)
Domain-0 0 0 0 -b- 807.9 0 / all
Domain-0 0 1 1 r-- 716.6 1 / all
And voila! Instantly no more raw system time going backwards events!
Also ntpd is again able to hold the clock stable again (after a reset
step by ntpdate).
I thought this might be because there's no way (that I know) to set the
tsc_mode for dom0, but given that the tsc_to_system_mul shown in the
debug printf is about what it should be to round down to 1GHz on this
machine then it seems RDTSC must be being emulated.
I guess the RDTSC emulation must not be stable across CPUs? Or?
Now I wait some days again to see if the newest xen_clock.c gives me any
more clues as to why, if it still happens, that domU clocks begin to
drift after ~7.5 days of uptime.....
--
Greg A. Woods <gwoods%acm.org@localhost>
Kelowna, BC +1 250 762-7675 RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgpJQJopn1wCq.pgp
Description: OpenPGP Digital Signature