Subject: Re: netbsd machines get slow and hang, nfs suspected
To: None <sommerfeld@orchard.medford.ma.us>
From: Gordon W. Ross <gwr@mc.com>
List: current-users
Date: 03/25/1996 11:28:48
> Date: Fri, 22 Mar 1996 22:48:44 -0500
> From: Bill Sommerfeld <sommerfeld@orchard.medford.ma.us>
> [10,000 context switches in one second??? And note that all of a sudden,
> for a short time, over 100 httpds become runnable. Whatever that means-
> they don't seem to do anything. We see this happen several more times.]
> 2 238 0 45216 124 200 104 119 32 0 256 78 495 106 10263 1 19 80
> 0 230 0 44808 124 169 52 77 4 0 122 78 394 191 6004 0 25 74
> 112 150 0 45092 244 62 171 52 66 0 327 80 338 92 8913 0 27 73
>
> It would be worthwhile to figure out which wait channel they were all
> piled up on; you can get this from `ps'. You may want to get both the
> `wchan' (the symbolic name) and the `nwchan' (the actual address
> waited on)
>
> Most likely all hundred were piled up on the same wait channel, and
> then something woke it up, causing a thundering herd of processes to
> run in circles trying to grab whatever it was they were waiting for...
>
> - Bill
Interesting suggestion. That reminds me: The SunOS kernel has
wakeone(int wchan); /* wakeup just one process */
for use with things like widely used "mutex" locks, where we
don't care who gets to proceed, and we know only one process
will actually be allowed to proceed past the mutex.
It would be nice to identify the comon places where we are
prone to the "thundering herd" problem, and change them to
use using wakeone() instead of wakeup().
Gordon