NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/58733: clock_nanosleep() (and I suspect more) is badly broken
>Number: 58733
>Category: kern
>Synopsis: clock_nanosleep() (and I suspect more) is badly broken
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Oct 09 07:55:00 +0000 2024
>Originator: Robert Elz
>Release: NetBSD 10.99.12
>Organization:
>Environment:
System: NetBSD jacaranda.noi.kre.to 10.99.12 NetBSD 10.99.12 (JACARANDA:1.1-20241009) #156: Wed Oct 9 11:55:18 +07 2024 kre%jacaranda.noi.kre.to@localhost:/usr/obj/testing/kernels/amd64/JACARANDA amd64
Architecture: x86_64
Machine: amd64
>Description:
clock_nanosleep() is used by sleep(1) to delay until a specified
time (called with TIMER_ABSTIME after sleep.c calculates what the
desired ending time is).
In normal circumstances, all goes well, nothing happens, and the
sleep ends at the desired time.
But not always...
NOTE: This affects NetBSD 10 and should be fixed before 10.1 is
released. Haven't looked at 9 (yet), but that's less urgent
right now.
>How-To-Repeat:
First, the system to run this test must have been running for more
than a short while - longer than the time we are to sleep for, or
the problem is not obvious (there's still a problem, but it you might
not notice it). Further, we want the sleep time we're testing to
be large enough so we can tell when sleep(1) exits too quickly.
Testing with "sleep 10" will be useless, the value needs to be
at least 1000, 10000 makes things even more obvious, but for that
the system needs to have been booted no less than 3 hours ago,
with 1000, an uptime of 20 mins is enough.
With such a setup, run (from a shell supporting job control)
/bin/sleep 1000 (or whatever)
^Z
fg (or bg if you prefer)
Do this reasonably quickly.
If all is well, (no bug occurs) when sleep is resumed after having
been stopped, it will just keep on sleeping until the specified
time has expired. (By all means, do date before and after, verify
the correct time has passed).
Don't worry, you don't need to wait that long, provided that the
system has been running long enough, as soon as you resume running
sleep again, it will exit. That's because clock_nanotime() has
returned prematurely in this case -- if the system hasn't been running
long enough then it won't immediately return, but the sleep will end
sooner than it should have - how much sooner depends upon how close
to having been up long enough you waited - if you run this test
immediately the system has booted, it will all appear to work
perfectly.
I know - that's what happened to me when I first started debugging
this. Boot new kernel, immediately run test. It works! I changed
nothing (relevant). Must be a compiler bug..... no.
>Fix:
Not a fix yet, but I understand the problem, and I think there is
a serious issue that needs to be fixed. I believe it affects more
than just clock_nanosleep().
clock_nanosleep() computes the number of ticks to sleep for, and
uses kpause() to wait for that long. If nothing happens, kpause()
waits that long (or as long as an int representation of the number
of ticks can manage - which is long enough that none of us will ever
observe what happens when that's exceeded) and all is good.
But kpause(9) says:
kpause() can wake up spontaneously. Callers should prepare
to handle it.
One thing that will cause a spontaneous early return is the process
having been stopped and resumed.
That's fine, and clock_nanosleep() is prepared to deal with that,
it uses the same code that deals with sleeps of very long periods
(more ticks than an int can represent). The code looks perfect,
and would be except for one little thing.
Wait for it, we're getting there.
The problem stems from this wording in ts2timo(9). ts2timo() is
the function used to convert a timespec specifying when something
is to happen, into a count of the number ticks to wait until that
something should have occurred (or just how long to sleep).
The issue is this (in ts2timo(9)):
If the interval is specified as an absolute time, then
the clock_id clock is used to convert it to a relative time.
Now as it stands that is fine, if we're passing in an absolute
end time (according to one of the system's clocks) then to work
out how long between now and then, we need to subtract now from
the end time, to determine how long (as a timespec first, then
converted into ticks). The whole issue is the use of "it" in
that sentence.
If ts2timo(9) instead said:
If the interval is specified as an absolute time, then
the clock_id clock is used to compute the relative time.
But as it is, ts2timo() says it will convert "it" (that being
the interval specified as an absolute time) into a relative time.
And that's what it does.
From what I can see, looking at the relatively few uses of ts2timo()
(including the one in nanosleep1() - the function that actually
implements clock_nanosleep() and nanosleep()) none of the callers
are expecting that.
nanosleep1() certainly isn't - that's why the sleep ends early.
Initially the correct number of ticks are computed, and it pauses.
When the pause ends, it determines if the current time is now
greater than the time to end, and if not, goes back to sleep again.
But because of ts2timo()'s conversion of the absolute time into a
relative time, what we're now testing (since sleep(1) is using
CLOCK_MONATONIC - which is approximately just uptime) if the uptime
is greater than the (original) relative time to sleep (that is, the
arg given to sleep(1)) then it looks as if the end time has passed,
and nanosleep1() concludes it has successfully finished.
There are several other calls of ts2timo() that use, or might use,
TIMER_ABSTIME
First is lwp_park() in sys_lwp.c (the generic implementation of the
_lwp_park() system call's many variants). That might use
TIMER_ABSTIME or TIMER_RELTIME at the caller's discretion (just
like clock_nanosleep()).
It suffers from the same manipulation by ts2timo() in the TIMER_ABSTIME
case, but doesn't care, as it never uses the ts again in that case,
only in the relative time case when the remaining time is computed
and returned to the caller. I didn't check, but I assume (hope) in
the TIMER_ABSTIME case, the (various) interfaces to this fuction don't
copyout() the timespec back to the caller (if they did, bad things
would probably be happening). (I can also comment on the
clock_timeleft() function, used only by lwp_park() from what I could
find - which is a little (or perhaps a lot) weird, though seems to
work OK for its one use).
Anyway, lwp_park() seems not to care what ts2timo() does, or doesn't
do to the timespec passed in in the TIMER_ABSTIME case.
Next are mq_recv1() and mq_send1() in sys_mqueue.c. Both functions
are identical for this pupose, they are handed an optional pointer
to a timespec (we're only interested in the case where it us used,
obviously). These always use TIMER_ABSTIME.
Each has a loop...
while (mqattr->mq_curmsgs ...
{
/* the condition is different for each function, but
for us here, it is irrelevant, all that matters is
that there is a loop. */
if (ts) {
error = ts2timo(..., TIMER_ABSTIME, ts, &t, ...);
/* the other args and error handling immaterial
here. That turns ts(timespec) -> t(ticks) */
} else t = 0; /* the case we don't care about */
error = cv_timedwait_sig(..., t);
/* do stuff, no longer than t ticks */
}
Very approximately. First time around the loop, that's all good.
Second time around, now we have the timespec ts points to having
been adjusted by ts2timo() to be a relative time. What used to
be a BIG number (since this actually uses CLOCK_REALTIME which is
the time of day clock, and hence already has 54 years (almost)
in its value) into a small one (the time to wait, in seconds, which
is unlikely to be anything like 54 years). So second time around
the loop, the current time will be > than the "absolute" end time
passed in to ts2timo() which will detect that, and return ETIMEDOUT.
The code not shown above detects that as "goto error", and the
loop abruptly terminates. Not what is expected I think.
Next is do_ksem_wait() in uips_sem.c and takes an optional timespec
as an absolute time (uses TIMER_ABSTIME only). do_ksem_wait()
is used by a few sys__ksem_*() functions, but only sys__ksem_timedwait()
has the ability to pass in a timespec - the other just pass NULL.
sys__ksem_timedwait() doesn't care what is left in the timespec
when do_ksem_wait() returns (it is never sent back to userland).
do_ksem_wait() contains a loop, really very similar to the one
used in the mqueue functions. I won't show the code again, it
has the same issues (exactly).
That's it.
So, it seems like nothing actually wants the "update the timespec
passed in" behaviour of ts2timo() - which the manual page even
only hints is what happens, it doesn't say it explicitly.
So, what I propose to do is to simply stop that from happening.
Make the "struct timespec *ts" parameter (the 3rd) of ts2timo()
be instead a "const struct timespec *ts" (no modifications allowed)
to make it quite clear that it will *not* be changed, and adjust
the code in ts2timo() to avoid doing that (at worst, this is one
copy of a struct timespec to be added).
But before I do that, I wanted to see if there is anyone who knows
why ts2timo() was written like that - it is a very peculiar way to
implement it by accident.
An alternative fix is simply to have clock_nanosleep() (whcih is the
interface I care about at the minute) pass a never to be used again
copy of the ABSTIME timespec to ts2timo() and retain the real one
to be used again next time (and I'd guess, those other functions above,
except possibly lwp_park() which doesn't seem to care) would need to
do that same.
Opinions?
Home |
Main Index |
Thread Index |
Old Index