Subject: Lost clock interrupts in 1.4Z
To: None <current-users@netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: current-users
Date: 06/05/2000 02:45:31
I see things like "time reset 15.640742 s" in the log file when running
heavy network tests.
This happens on both Intel and Alpha systems. They keep time correctly
when there isn't any nasty network activity.
I'm running nasty tests, hitting the receiver as hard as I can, looking
for troubles like this.
I'm getting the "increase NMBCLUSTERS" warnings.
I assume that clock interrupts are getting lost because a network
driver is running at interrupt level continuously for more than one
clock tick. I don't have any data to back that up. Are there other
possible explanations?
1) Is it reasonable for heavy network activity to cause clock problems?
"Reasonable" isn't the right word to use. I think I'm asking about
priorities. How serious do people consider clock-slipping problems?
This is probably hard to fix. I think it would make sense to add
this to the known glitch list but leave it on the back burner until
somebody gets interested in working on it.
It might be appropriate to call these tests abusive or stupid and
say "don't do that". I don't think that's a good enough answer.
Somebody probably wants to run a server connected to the big-bad
internet. This glitch might not happen if you only have a 10 megabit
connection - but might happen more often on a slower machine.
There are several ways I can think of to fix this. One is for
all the drivers to limit the amount of processing they do - return
to the interrupt dispatcher before they are done so it can check
for a clock interrupt and/or give other drivers a chance to run
too. A second way is to use the cycle counter for the timekeeping.
(Of course, that only works on machines with appropriate hardware.)
A third is to use multiple interrupt levels - let the clock interrupt
the network drivers. How much hardware supports that?
2) I don't understand the buffer allocation strategy. I assume they
are piling up on the input dispatch queue and can't get processed
because all the CPU cycles are getting burned by the interupt routines.
Are there any mechanisims to limit that? Will things get better
if I make NMBCLUSTERS big enough? If so, how big is that? I'm
running with 4K or 8K now. That works fine for everything but
the nasty UDP flood type tests.
Should drivers stop listening in cases like this to free up the
CPU? (I'm interested in the theory. I'm not volunteering to write
the code.)
3) Some drivers call printf from interrupt level. That aggrivates
this problem. It also provides obscure information which might be
critical to tracking down a problem. This look like a hard/interesting
tradeoff to me.
Is there any policy on printing from interrupt level? Perhaps
there should be a compile time switch? Or maybe a mechanisim to
register counters with a background task which will watch them
and print an occasional message if they change.