tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Understanding PR kern/43997 (kernel timing problems / qemu)



    Date:        Sun, 30 Jul 2017 16:04:38 -0000 (UTC)
    From:        mlelstv%serpens.de@localhost (Michael van Elst)
    Message-ID:  <oll02b$ikd$1%serpens.de@localhost>

  | There are slower emulated systems that don't have these issues. (*)

Yes, that it is not qemu's execution speed was (really, always) becoming
obvious.

  | If the host misses interrupts, time in the guest just passes slower
  | than real-time. But inside the guest it is consistent.

If we could achieve that (which changing the timecounter in qemu
apparently achieves) it would at least make the world become rational.
Of course, keeping the timing running faster would be better - if we were
able to get to a state where the client/guest were actually able to talk
to the outside world (that part is easy) and run NTP, and act as a time
server that others could trust, that would be ideal.

  | This is not to be confused with the kernel idea of wall-clock time
  | (i.e. what date reports). wall-clock time is usually maintained
  | by hardware seperated from the interrupt timers. The 'date; sleep 5; date'
  | sequence therefore can show that 10 seconds passed.

But that is totally broken.   While there is no guarantee that a sleep will
wake up after exactly the time requested, it should be as close as is
reasonably possible - and on an unloaded system, where there is sufficient
RAM, and nothing swapped out, and nothing computing for cpu cycles, that
sequence should (always) show something between 5 and 5 and a bit seconds
have passed.   If the cpu is busy, or things are getting swapped/paged out,
then we can expect slower (not only for processes waiting upon timer signals,
but for everything), and that's acceptable.

But otherwise, inconsistent timing is not acceptable.   All kinds of
applications (including network protocols) require time to be kept in a
way that is at least close to what others observe, even if not identical.

One easy (poor) fix is simply to do as used to be done, and have kernel
wall clock time maintained by the tick interrupt - that makes things
consistent, but without any real expectation of accuracy.  The alternative
is to make the tic counts depend upon the external wall clock time source,
so they keep in sync - much the same as the power companies do with frequency,
over any short period, the nominal 50/60 Hz frequency can drift around a lot,
but when measured over any reasonable period, those things are highly accurate
(which is why old AC frequency based tick systems used to have very good
long term time stability, provided they never lost clock interrupts.)

  | The problem with qemu is that it's running on a NetBSD host and
  | therefore cannot issue interrupts based on host time unless the
  | host has a larger HZ value.

In the system of most interest, the host, and the guest, are the exact
same system (the exact same binary kernel) - unless we alter the config
of one of them explicitly to avoid this issue, they cannot help but have
the same HZ value.

As long as the emulated qemu client has access to a reasonably accurate ToD
value (which it obviously does, as the host's time is available to qemu, and
can be, and is it seems, made available to the guest) there's no reason at
all the guest cannot produce the correct number of ticks.

And doing so (since it is just a generic NetBSD) would solve the similar,
but less blatant issue for any other system using ticks, where the occasional
clock interrupt might get lost, and where there is some other ToD source
available.

  | With host and guest running at HZ=100, it's obvious that interrupts
  | mostly come just too late and require two ticks on the host, thus
  | slowing down guest time by a factor of two.

Yes, that is a very good explanation for the observed behaviour, and I
cannot help but be grateful that simply beginning to discuss this issue
has provided so many insights into what is happening, and what we can do
to fix things.

When there is no alternative than tick interrupts, we can, and do, use
those to measure time, and everything works - just if the ticks are not
received at the expected rate time keeping drifts away from real time
(but invisibly when considered only within the system.)

When there is some better measure of real time we can use, we can make sure
that keeps all time keeping synchronised better, regardless of whether the
system is "tickless" or still tick based - it isn't required that every
single tick be 1/HZ apart (they never are precisely anyway) just that over
the long term (which in computing is a half second or so) the correct number
of ticks have occurred.

I think it should be possible to make that happen, and that is what I am
going to see if I can do.   Then we can see if we can find a (good enough)
way to make nanosleep() less ticky - whether by giving up on ticks
altogether (which is probably not the best solution - even if we don't
use ticks for timing, we'd end up emulating them for other things, if only
to avoid needing to rewrite too much of the kernel in one step) or by
implementing some other mechanism (interrupts from a shirt term timer not
used for time calculations at all perhaps, for very short delays only,
I have no idea - yet).

kre

ps: I simply do not care if there is (or could be) a much better fix for
these issues than the type I am considering - if someone implements that
before I am done, that's great, and we have made even more progress than
expected.   If not, they're still free to implement it after, and in the
interim, we will have a system that is better than we have now, and then
perhaps later, even better still.   What this means is that I will totally
ignore "it should be done a better way than that" arguments...   If there
is a defect in what I do (SMP related problems, or something) that I will
listen to, and attempt to fix, but "you should have done it this other way
instead" will go nowhere.



Home | Main Index | Thread Index | Old Index