kern/58733: clock_nanosleep() (and I suspect more) is badly broken

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/58733: clock_nanosleep() (and I suspect more) is badly broken
From: kre%munnari.OZ.AU@localhost
Date: Wed, 9 Oct 2024 07:55:00 +0000 (UTC)
>Number:         58733
>Category:       kern
>Synopsis:       clock_nanosleep() (and I suspect more) is badly broken
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 09 07:55:00 +0000 2024
>Originator:     Robert Elz
>Release:        NetBSD 10.99.12
>Organization:
>Environment:
System: NetBSD jacaranda.noi.kre.to 10.99.12 NetBSD 10.99.12 (JACARANDA:1.1-20241009) #156: Wed Oct 9 11:55:18 +07 2024 kre%jacaranda.noi.kre.to@localhost:/usr/obj/testing/kernels/amd64/JACARANDA amd64
Architecture: x86_64
Machine: amd64
>Description:

	clock_nanosleep() is used by sleep(1) to delay until a specified
	time (called with TIMER_ABSTIME after sleep.c calculates what the
	desired ending time is).

	In normal circumstances, all goes well, nothing happens, and the
	sleep ends at the desired time.

	But not always...

	NOTE:	This affects NetBSD 10 and should be fixed before 10.1 is
		released.   Haven't looked at 9 (yet), but that's less urgent
		right now.

>How-To-Repeat:

	First, the system to run this test must have been running for more
	than a short while - longer than the time we are to sleep for, or
	the problem is not obvious (there's still a problem, but it you might
	not notice it).   Further, we want the sleep time we're testing to
	be large enough so we can tell when sleep(1) exits too quickly.
	Testing with "sleep 10" will be useless, the value needs to be
	at least 1000, 10000 makes things even more obvious, but for that
	the system needs to have been booted no less than 3 hours ago,
	with 1000, an uptime of 20 mins is enough.

	With such a setup, run (from a shell supporting job control)

		/bin/sleep 1000			(or whatever)
		^Z
		fg  (or bg if you prefer)

	Do this reasonably quickly.

	If all is well, (no bug occurs) when sleep is resumed after having
	been stopped, it will just keep on sleeping until the specified
	time has expired.   (By all means, do date before and after, verify
	the correct time has passed).

	Don't worry, you don't need to wait that long, provided that the
	system has been running long enough, as soon as you resume running
	sleep again, it will exit.   That's because clock_nanotime() has
	returned prematurely in this case -- if the system hasn't been running
	long enough then it won't immediately return, but the sleep will end
	sooner than it should have - how much sooner depends upon how close
	to having been up long enough you waited - if you run this test
	immediately the system has booted, it will all appear to work
	perfectly.

	I know - that's what happened to me when I first started debugging
	this.   Boot new kernel, immediately run test.  It works!   I changed
	nothing (relevant).   Must be a compiler bug..... no.

>Fix:
	Not a fix yet, but I understand the problem, and I think there is
	a serious issue that needs to be fixed.   I believe it affects more
	than just clock_nanosleep().

	clock_nanosleep() computes the number of ticks to sleep for, and
	uses kpause() to wait for that long.   If nothing happens, kpause()
	waits that long (or as long as an int representation of the number
	of ticks can manage - which is long enough that none of us will ever
	observe what happens when that's exceeded) and all is good.

	But kpause(9) says:
	   kpause() can wake up spontaneously.  Callers should prepare
	   to handle it.

	One thing that will cause a spontaneous early return is the process
	having been stopped and resumed.

	That's fine, and clock_nanosleep() is prepared to deal with that,
	it uses the same code that deals with sleeps of very long periods
	(more ticks than an int can represent).   The code looks perfect,
	and would be except for one little thing.

	Wait for it, we're getting there.

	The problem stems from this wording in ts2timo(9).   ts2timo() is
	the function used to convert a timespec specifying when something
	is to happen, into a count of the number ticks to wait until that
	something should have occurred (or just how long to sleep).

	The issue is this (in ts2timo(9)):

	        If the interval is specified as an absolute time, then
		the clock_id clock is used to convert it to a relative time.

	Now as it stands that is fine, if we're passing in an absolute
	end time (according to one of the system's clocks) then to work
	out how long between now and then, we need to subtract now from
	the end time, to determine how long (as a timespec first, then
	converted into ticks).   The whole issue is the use of "it" in
	that sentence.

	If ts2timo(9) instead said:

		If the interval is specified as an absolute time, then
		the clock_id clock is used to compute the relative time.

	But as it is, ts2timo() says it will convert "it" (that being
	the interval specified as an absolute time) into a relative time.
	And that's what it does.

	From what I can see, looking at the relatively few uses of ts2timo()
	(including the one in nanosleep1() - the function that actually
	implements clock_nanosleep() and nanosleep()) none of the callers
	are expecting that.

	nanosleep1() certainly isn't - that's why the sleep ends early.
	Initially the correct number of ticks are computed, and it pauses.
	When the pause ends, it determines if the current time is now
	greater than the time to end, and if not, goes back to sleep again.
	But because of ts2timo()'s conversion of the absolute time into a
	relative time, what we're now testing (since sleep(1) is using
	CLOCK_MONATONIC - which is approximately just uptime) if the uptime
	is greater than the (original) relative time to sleep (that is, the
	arg given to sleep(1)) then it looks as if the end time has passed,
	and nanosleep1() concludes it has successfully finished.

	There are several other calls of ts2timo() that use, or might use,
	TIMER_ABSTIME

	First is lwp_park() in sys_lwp.c (the generic implementation of the
	_lwp_park() system call's many variants).   That might use
	TIMER_ABSTIME or TIMER_RELTIME at the caller's discretion (just
	like clock_nanosleep()).

	It suffers from the same manipulation by ts2timo() in the TIMER_ABSTIME
	case, but doesn't care, as it never uses the ts again in that case,
	only in the relative time case when the remaining time is computed
	and returned to the caller.   I didn't check, but I assume (hope) in
	the TIMER_ABSTIME case, the (various) interfaces to this fuction don't
	copyout() the timespec back to the caller (if they did, bad things
	would probably be happening).   (I can also comment on the
	clock_timeleft() function, used only by lwp_park() from what I could
	find - which is a little (or perhaps a lot) weird, though seems to
	work OK for its one use).

	Anyway, lwp_park() seems not to care what ts2timo() does, or doesn't
	do to the timespec passed in in the TIMER_ABSTIME case.

	Next are mq_recv1() and mq_send1() in sys_mqueue.c.  Both functions
	are identical for this pupose, they are handed an optional pointer
	to a timespec (we're only interested in the case where it us used,
	obviously).   These always use TIMER_ABSTIME.

	Each has a loop...

		while (mqattr->mq_curmsgs ...
		{
			/* the condition is different for each function, but
			   for us here, it is irrelevant, all that matters is
			   that there is a loop.  */

			if (ts) {
			    error = ts2timo(..., TIMER_ABSTIME, ts, &t, ...);
				/* the other args and error handling immaterial
				   here.  That turns ts(timespec) -> t(ticks) */
			 } else t = 0;	/* the case we don't care about */

			 error = cv_timedwait_sig(..., t);
				/* do stuff, no longer than t ticks */
		}

	Very approximately.   First time around the loop, that's all good.
	Second time around, now we have the timespec ts points to having
	been adjusted by ts2timo() to be a relative time.   What used to
	be a BIG number (since this actually uses CLOCK_REALTIME which is
	the time of day clock, and hence already has 54 years (almost)
	in its value) into a small one (the time to wait, in seconds, which
	is unlikely to be anything like 54 years).   So second time around
	the loop, the current time will be > than the "absolute" end time
	passed in to ts2timo() which will detect that, and return ETIMEDOUT.
	The code not shown above detects that as "goto error", and the
	loop abruptly terminates.   Not what is expected I think.

	Next is do_ksem_wait() in uips_sem.c and takes an optional timespec
	as an absolute time (uses TIMER_ABSTIME only).   do_ksem_wait()
	is used by a few sys__ksem_*() functions, but only sys__ksem_timedwait()
	has the ability to pass in a timespec - the other just pass NULL.

	sys__ksem_timedwait() doesn't care what is left in the timespec
	when do_ksem_wait() returns (it is never sent back to userland).

	do_ksem_wait() contains a loop, really very similar to the one
	used in the mqueue functions.   I won't show the code again, it
	has the same issues (exactly).

	That's it.

	So, it seems like nothing actually wants the "update the timespec
	passed in" behaviour of ts2timo() - which the manual page even
	only hints is what happens, it doesn't say it explicitly.


	So, what I propose to do is to simply stop that from happening.
	Make the "struct timespec *ts" parameter (the 3rd) of ts2timo()
	be instead a "const struct timespec *ts" (no modifications allowed)
	to make it quite clear that it will *not* be changed, and adjust
	the code in ts2timo() to avoid doing that (at worst, this is one
	copy of a struct timespec to be added).

	But before I do that, I wanted to see if there is anyone who knows
	why ts2timo() was written like that - it is a very peculiar way to
	implement it by accident.

	An alternative fix is simply to have clock_nanosleep() (whcih is the
	interface I care about at the minute) pass a never to be used again
	copy of the ABSTIME timespec to ts2timo() and retain the real one
	to be used again next time (and I'd guess, those other functions above,
	except possibly lwp_park() which doesn't seem to care) would need to
	do that same.

	Opinions?
Prev by Date: Re: kern/58730: NFS client locks up waiting on nfscn2
Next by Date: Re: kern/58733 (clock_nanosleep() (and I suspect more) is badly broken)
Previous by Thread: Re: port-amd64/58732: i915drmkms black screen on boot
Next by Thread: Re: kern/58733: clock_nanosleep() (and I suspect more) is badly broken
Indexes:
Home | Main Index | Thread Index | Old Index