Subject: Re: netbsd machines get slow and hang, nfs suspected
To: None <sommerfeld@orchard.medford.ma.us>
From: Gordon W. Ross <gwr@mc.com>
List: current-users
Date: 03/25/1996 11:28:48
> Date: Fri, 22 Mar 1996 22:48:44 -0500
> From: Bill Sommerfeld <sommerfeld@orchard.medford.ma.us>

> [10,000 context switches in one second??? And note that all of a sudden,
> for a short time, over 100 httpds become runnable. Whatever that means-
> they don't seem to do anything. We see this happen several more times.]
>  2 238 0 45216   124  200 104 119  32   0 256 78  495  106 10263  1 19  80
>  0 230 0 44808   124  169  52  77   4   0 122 78  394  191 6004  0 25  74
>  112 150 0 45092   244   62 171  52  66   0 327 80  338   92 8913  0 27  73
> 
> It would be worthwhile to figure out which wait channel they were all
> piled up on; you can get this from `ps'.  You may want to get both the
> `wchan' (the symbolic name) and the `nwchan' (the actual address
> waited on)
> 
> Most likely all hundred were piled up on the same wait channel, and
> then something woke it up, causing a thundering herd of processes to
> run in circles trying to grab whatever it was they were waiting for...
> 
> 						- Bill

Interesting suggestion.  That reminds me:  The SunOS kernel has

	wakeone(int wchan); /* wakeup just one process */

for use with things like widely used "mutex" locks, where we
don't care who gets to proceed, and we know only one process
will actually be allowed to proceed past the mutex.

It would be nice to identify the comon places where we are
prone to the "thundering herd" problem, and change them to
use using wakeone() instead of wakeup().

Gordon