Subject: netbsd machines get slow and hang, nfs suspected
To: None <current-users@NetBSD.ORG>
From: Eric Volpe <epv@panix.com>
List: current-users
Date: 03/22/1996 14:36:32
Hi! We have a weird problem, potentially NFS-related.
We have a bunch of machines (586's) running netbsd-current and doing
essentially nothing but running apache httpd and perl. They run the
executables off their local disk, but get the html that they're
serving via NFS from a server also running current.
These machines regularly become unresponsive, their load averages rise
to 50 or 60x the "normal" level, and they barely even echo keystrokes
on the console. If you detach the ethernet, within a few seconds,
they start responding again almost immediately. When they are wedged,
they return pings and inetd accepts connections immediately but never
manages to execute the appropriate program for them.
We suspect a couple of problems; one is the client-side NFS code in
NetBSD (interestingly, we've seen something like this happen on BSDi
also) -- The NFS server of these machines is handling a heavy load;
sometimes when its client machines are dragging like this, we see that
many (100+) httpd processes are blocked waiting on I/O, presumably
unable to get response from the nfs server. However, at the same time,
our sparcs, accessing the same server, have no problem.
Vmstat output while this is going on is interesting, and included below.
Eric Volpe
-----------------------------vmstat output---------------------------
[At this point the problem exists, but I've unplugged the ethernet and
so I can talk to it.]
procs memory page faults cpu
r b w avm fre flt re pi po fr sr w0 in sy cs us sy id
0 0 0 9684 784 71 46 24 13 0 78 23 377 251 899 1 7 92
0 2 0 9764 752 8 0 3 0 0 0 3 110 36 10 0 0 100
0 1 0 9548 700 8 0 14 0 0 0 5 115 28 10 0 0 100
[At this point, I reconnect the ethernet. It takes a few seconds for it
to wake up...]
0 1 0 9548 700 2 0 0 0 0 0 0 102 21 4 0 0 100
0 1 0 9824 700 2 0 0 0 0 0 0 119 21 5 0 0 100
0 26 0 16900 668 14 0 7 0 0 0 16 190 29 233 0 0 100
[Now, this gets weird. Why are those processes blocked now? The NFS
server is definitely up- so perhaps we can't talk to the NFS server?
All of a sudden the box is paging like crazy (basically, all disk IOs
are for paging) but it's not doing a damn thing.]
0 133 0 39408 336 137 23 95 16 0 132 60 345 138 1168 0 5 95
0 144 0 40224 324 87 37 61 26 0 247 76 291 148 195 0 2 98
0 148 0 40224 336 51 110 53 39 0 417 78 313 86 1089 0 6 93
0 154 0 40224 340 66 69 48 47 0 252 80 317 76 1240 0 6 94
0 139 0 39392 332 91 45 108 26 0 261 78 297 120 206 1 6 94
0 135 0 39668 296 83 24 72 14 0 142 74 286 125 260 0 5 95
0 141 0 39932 332 91 9 78 7 0 93 70 287 140 275 0 4 96
0 148 0 40488 340 90 48 70 31 0 247 74 292 164 175 0 5 94
0 164 0 40488 340 26 22 21 66 0 180 75 292 52 1062 0 4 96
3 158 0 40164 320 99 127 40 55 0 269 76 301 88 332 0 4 96
0 115 0 32984 352 98 73 70 43 0 237 71 469 153 761 0 2 97
0 129 0 32720 340 48 68 41 47 0 262 76 299 87 433 0 4 96
procs memory page faults cpu
r b w avm fre flt re pi po fr sr w0 in sy cs us sy id
0 137 0 31484 316 65 52 53 42 0 279 75 308 110 200 0 4 96
0 122 0 27248 324 32 56 27 55 0 179 74 304 57 591 0 2 98
0 125 0 25992 340 43 42 10 73 0 142 78 305 60 1808 0 4 96
1 165 0 43956 340 78 149 45 18 0 378 72 345 105 1928 0 6 94
1 167 0 44424 344 158 22 75 4 0 71 68 399 126 3255 1 9 89
0 267 0 44952 340 152 40 97 14 0 249 76 389 116 4576 0 15 85
0 268 0 44952 340 123 70 93 38 0 217 75 575 127 7942 0 13 86
[10,000 context switches in one second??? And note that all of a sudden,
for a short time, over 100 httpds become runnable. Whatever that means-
they don't seem to do anything. We see this happen several more times.]
2 238 0 45216 124 200 104 119 32 0 256 78 495 106 10263 1 19 80
0 230 0 44808 124 169 52 77 4 0 122 78 394 191 6004 0 25 74
112 150 0 45092 244 62 171 52 66 0 327 80 338 92 8913 0 27 73
0 208 0 44468 124 79 27 27 26 0 55 82 357 89 6862 0 20 80
0 249 0 43928 124 37 126 32 52 0 224 77 281 38 8457 0 27 73
0 245 0 43580 124 113 51 56 52 0 147 82 371 103 6816 0 24 76
0 192 0 43988 124 91 64 22 14 0 116 77 325 73 5484 1 19 79
0 259 0 43368 124 56 122 62 68 0 268 84 318 56 9186 0 28 72
1 188 0 43716 124 100 23 13 30 0 54 74 304 93 4672 1 20 79
127 140 0 43740 124 52 108 68 81 0 265 82 294 40 10112 0 30 70
1 268 0 44004 316 18 140 14 66 0 296 83 286 18 10925 0 25 75
procs memory page faults cpu
r b w avm fre flt re pi po fr sr w0 in sy cs us sy id
0 239 0 44196 124 136 23 43 15 0 40 76 380 87 5878 0 15 85
0 270 0 44196 336 48 136 34 70 0 277 78 318 39 9637 0 31 69
0 229 0 44196 124 121 23 42 16 0 33 77 343 63 7186 1 24 75
0 229 0 44536 124 89 60 43 21 0 134 74 349 76 7129 0 17 83
0 266 0 45368 124 78 141 51 71 0 278 78 361 43 9139 0 30 70
0 202 0 45644 124 123 45 41 12 0 99 75 448 132 5929 0 18 82
1 226 0 45644 124 55 74 38 72 0 171 75 355 60 8164 0 33 67
0 255 0 45372 124 47 111 46 36 0 241 78 351 54 8802 0 28 72
0 240 0 45636 124 99 48 36 57 0 121 81 426 85 7350 0 30 70
0 244 0 46204 124 63 123 38 36 0 213 79 397 57 8685 0 21 79
0 203 0 46204 124 89 44 24 52 0 105 78 372 80 6416 0 22 78
1 201 0 46216 124 81 69 53 30 0 162 77 410 149 6066 0 28 71
0 261 0 46500 124 30 118 46 59 0 242 76 381 38 9514 0 30 70
1 192 0 46544 124 88 28 13 34 0 61 79 387 78 5465 0 20 80
0 272 0 46616 292 132 202 116 101 0 557 77 825 111 17351 0 25 74
0 225 0 46524 124 127 36 42 32 0 74 76 397 104 6516 0 23 76
0 258 0 46540 124 41 124 51 51 0 249 76 399 33 9562 0 31 69
0 191 0 46540 124 89 34 12 30 0 65 78 424 80 5169 0 24 76
procs memory page faults cpu
r b w avm fre flt re pi po fr sr w0 in sy cs us sy id
0 214 0 46540 124 60 69 54 39 0 178 75 379 57 5635 0 25 75
0 260 0 47068 124 30 115 42 54 0 229 76 310 24 9079 0 30 70
14 173 0 47068 124 93 40 17 37 0 75 75 305 58 5477 0 19 80
89 148 0 47128 124 85 66 94 44 0 203 78 318 58 7289 0 24 75
0 248 0 47280 124 24 120 17 53 0 210 79 295 29 9537 0 28 72
0 213 0 47340 124 106 40 44 19 0 111 73 260 67 3973 0 13 87
0 250 0 46976 124 50 47 47 71 0 162 80 268 34 7222 0 26 74
0 225 0 47040 124 39 104 16 32 0 175 78 269 30 7255 0 23 77
0 211 0 46776 124 80 53 51 24 0 138 76 259 65 4260 0 14 86
85 148 0 46600 168 46 52 37 77 0 164 78 266 46 5585 0 22 78
0 191 0 46600 124 64 113 23 45 0 189 77 332 84 5040 0 12 88
0 192 0 46336 340 70 82 56 62 0 378 79 370 60 4107 0 11 89
0 181 0 46072 336 50 86 34 52 0 346 84 281 56 2266 0 4 96
1 130 0 45360 296 64 80 51 11 0 264 73 308 69 671 0 5 95
1 112 0 40788 340 57 41 45 25 0 114 60 256 70 70 0 2 98
1 97 0 38084 272 23 25 16 11 0 46 26 174 115 36 0 1 98
1 97 0 37216 272 2 0 0 0 0 0 0 110 52 8 0 0 100
procs memory page faults cpu
r b w avm fre flt re pi po fr sr w0 in sy cs us sy id
0 80 0 35336 328 7 10 4 6 0 26 28 54 29 14 0 5 95
[etc., ad nauseum, until we reboot...]