Subject: Lost clock interrupts in 1.4Z
To: None <current-users@netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: current-users
Date: 06/05/2000 02:45:31
I see things like "time reset 15.640742 s" in the log file when running 
heavy network tests.

This happens on both Intel and Alpha systems.  They keep time correctly 
when there isn't any nasty network activity. 

I'm running nasty tests, hitting the receiver as hard as I can, looking 
for troubles like this.

I'm getting the "increase NMBCLUSTERS" warnings. 

I assume that clock interrupts are getting lost because a network 
driver is running at interrupt level continuously for more than one 
clock tick.  I don't have any data to back that up.  Are there other 
possible explanations? 


1) Is it reasonable for heavy network activity to cause clock problems?

  "Reasonable" isn't the right word to use.  I think I'm asking about 
  priorities.  How serious do people consider clock-slipping problems? 
  
  This is probably hard to fix.  I think it would make sense to add 
  this to the known glitch list but leave it on the back burner until 
  somebody gets interested in working on it. 
  
  It might be appropriate to call these tests abusive or stupid and 
  say "don't do that".  I don't think that's a good enough answer. 
  Somebody probably wants to run a server connected to the big-bad 
  internet.  This glitch might not happen if you only have a 10 megabit 
  connection - but might happen more often on a slower machine. 
  
  There are several ways I can think of to fix this.  One is for 
  all the drivers to limit the amount of processing they do - return 
  to the interrupt dispatcher before they are done so it can check 
  for a clock interrupt and/or give other drivers a chance to run 
  too.  A second way is to use the cycle counter for the timekeeping.  
  (Of course, that only works on machines with appropriate hardware.)  
  A third is to use multiple interrupt levels - let the clock interrupt 
  the network drivers.  How much hardware supports that?  


2) I don't understand the buffer allocation strategy.  I assume they 
are piling up on the input dispatch queue and can't get processed 
because all the CPU cycles are getting burned by the interupt routines.  

  Are there any mechanisims to limit that?  Will things get better 
  if I make NMBCLUSTERS big enough?  If so, how big is that?  I'm 
  running with 4K or 8K now.  That works fine for everything but 
  the nasty UDP flood type tests.

  Should drivers stop listening in cases like this to free up the 
  CPU?  (I'm interested in the theory.  I'm not volunteering to write 
  the code.) 


3) Some drivers call printf from interrupt level.  That aggrivates 
this problem.  It also provides obscure information which might be 
critical to tracking down a problem.  This look like a hard/interesting 
tradeoff to me.  

  Is there any policy on printing from interrupt level?  Perhaps 
  there should be a compile time switch?  Or maybe a mechanisim to 
  register counters with a background task which will watch them 
  and print an occasional message if they change.