Subject: Bug in timeout()/untimeout() ?
To: None <tech-kern@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: tech-kern
Date: 04/14/2000 02:30:09
Hi,
the PR from John Hawkinson about his IDE drive taking 11mn to declare a
block bad made me suspect a problem in timeout()/untimeout()

Let's look at softclock() (from kern/clock.c):
        while ((c = calltodo.c_next) != NULL && c->c_time <= 0) {
		func = c->c_func;
		arg = c->c_arg;
		calltodo.c_next = c->c_next;
		c->c_next = callfree;
		callfree = c;
		splx(s);
		(*func)(arg);
		(void) splhigh();
	}

My theory is based on the fact that c_next is not declared volatile, so
calltodo.c_next may be cached in a register and not re-read from memory
in the next iteration. If this isn't true please stop here and ignore the
following (and tell me I don't understand anything at compilers :)
Now, as we are at splhigh() and we lower the spl (splx(s)) before calling
(*func), anything may happen here, especially an interrupt routine which
make use the timeout/untimeout code; also note that (*func) itself may use
the timeout code (wdctimeout() does).

So imaging we have the folloing:
calltodo->A->B->C->D->NULL
callfree->X->Y->NULL
for some reason both A and B are due (for example because we have been running
at > splsoftclock for too long; may also be because they were just sheduled for
the same tick). Just before lowering interrupts we have:
calltodo->B->C->D->NULL
callfree->A->->C->NULL
while interrupts are lowered something untimeouts B; we have:
calltodo->C->NULL
callfree->B->A->->C->D->NULL
But because of the cached value in the loop, B is processed anyway, and put
a second time on the free list:
calltodo->C->D->NULL
callfree->B->B
and voila, B points to itself, we have a loop in a list processed
mostly at splhigh(). If during a (brief) period where the spl is lowered
something adds a timeout callback, it's easy to show that this will create
a loop in calltodo, which is processed at 2 places at splhigh() without
splx() in the loop.

I dind't look closely but it looks like this problem can't happen with the
new callout interface.

--
Manuel Bouyer <bouyer@antioche.eu.org>
--