Subject: netbsd machines get slow and hang, nfs suspected
To: None <current-users@NetBSD.ORG>
From: Eric Volpe <epv@panix.com>
List: current-users
Date: 03/22/1996 14:36:32
Hi! We have a weird problem, potentially NFS-related. 

We have a bunch of machines (586's) running netbsd-current and doing
essentially nothing but running apache httpd and perl. They run the
executables off their local disk, but get the html that they're
serving via NFS from a server also running current. 

These machines regularly become unresponsive, their load averages rise
to 50 or 60x the "normal" level, and they barely even echo keystrokes
on the console.  If you detach the ethernet, within a few seconds,
they start responding again almost immediately.  When they are wedged,
they return pings and inetd accepts connections immediately but never
manages to execute the appropriate program for them.

We suspect a couple of problems; one is the client-side NFS code in
NetBSD (interestingly, we've seen something like this happen on BSDi
also) -- The NFS server of these machines is handling a heavy load;
sometimes when its client machines are dragging like this, we see that
many (100+) httpd processes are blocked waiting on I/O, presumably
unable to get response from the nfs server. However, at the same time,
our sparcs, accessing the same server, have no problem. 

Vmstat output while this is going on is interesting, and included below.


					Eric Volpe
					

-----------------------------vmstat output---------------------------
[At this point the problem exists, but I've unplugged the ethernet and
so I can talk to it.]
 procs   memory     page                         faults      cpu
 r b w   avm   fre  flt  re  pi  po  fr  sr w0   in   sy  cs us sy  id
 0 0 0  9684   784   71  46  24  13   0  78 23  377  251 899  1  7  92
 0 2 0  9764   752    8   0   3   0   0   0  3  110   36  10  0  0 100
 0 1 0  9548   700    8   0  14   0   0   0  5  115   28  10  0  0 100

[At this point, I reconnect the ethernet. It takes a few seconds for it
to wake up...]

 0 1 0  9548   700    2   0   0   0   0   0  0  102   21   4  0  0 100
 0 1 0  9824   700    2   0   0   0   0   0  0  119   21   5  0  0 100
 0 26 0 16900   668   14   0   7   0   0   0 16  190   29 233  0  0 100

[Now, this gets weird. Why are those processes blocked now? The NFS
server is definitely up- so perhaps we can't talk to the NFS server?
All of a sudden the box is paging like crazy (basically, all disk IOs
are for paging) but it's not doing a damn thing.]

 0 133 0 39408   336  137  23  95  16   0 132 60  345  138 1168  0  5  95
 0 144 0 40224   324   87  37  61  26   0 247 76  291  148 195  0  2  98
 0 148 0 40224   336   51 110  53  39   0 417 78  313   86 1089  0  6  93
 0 154 0 40224   340   66  69  48  47   0 252 80  317   76 1240  0  6  94
 0 139 0 39392   332   91  45 108  26   0 261 78  297  120 206  1  6  94
 0 135 0 39668   296   83  24  72  14   0 142 74  286  125 260  0  5  95
 0 141 0 39932   332   91   9  78   7   0  93 70  287  140 275  0  4  96
 0 148 0 40488   340   90  48  70  31   0 247 74  292  164 175  0  5  94
 0 164 0 40488   340   26  22  21  66   0 180 75  292   52 1062  0  4  96
 3 158 0 40164   320   99 127  40  55   0 269 76  301   88 332  0  4  96
 0 115 0 32984   352   98  73  70  43   0 237 71  469  153 761  0  2  97
 0 129 0 32720   340   48  68  41  47   0 262 76  299   87 433  0  4  96
 procs   memory     page                         faults      cpu
 r b w   avm   fre  flt  re  pi  po  fr  sr w0   in   sy  cs us sy  id
 0 137 0 31484   316   65  52  53  42   0 279 75  308  110 200  0  4  96
 0 122 0 27248   324   32  56  27  55   0 179 74  304   57 591  0  2  98
 0 125 0 25992   340   43  42  10  73   0 142 78  305   60 1808  0  4  96
 1 165 0 43956   340   78 149  45  18   0 378 72  345  105 1928  0  6  94
 1 167 0 44424   344  158  22  75   4   0  71 68  399  126 3255  1  9  89
 0 267 0 44952   340  152  40  97  14   0 249 76  389  116 4576  0 15  85
 0 268 0 44952   340  123  70  93  38   0 217 75  575  127 7942  0 13  86

[10,000 context switches in one second??? And note that all of a sudden,
for a short time, over 100 httpds become runnable. Whatever that means-
they don't seem to do anything. We see this happen several more times.]
 2 238 0 45216   124  200 104 119  32   0 256 78  495  106 10263  1 19  80
 0 230 0 44808   124  169  52  77   4   0 122 78  394  191 6004  0 25  74
 112 150 0 45092   244   62 171  52  66   0 327 80  338   92 8913  0 27  73
 0 208 0 44468   124   79  27  27  26   0  55 82  357   89 6862  0 20  80
 0 249 0 43928   124   37 126  32  52   0 224 77  281   38 8457  0 27  73
 0 245 0 43580   124  113  51  56  52   0 147 82  371  103 6816  0 24  76
 0 192 0 43988   124   91  64  22  14   0 116 77  325   73 5484  1 19  79
 0 259 0 43368   124   56 122  62  68   0 268 84  318   56 9186  0 28  72
 1 188 0 43716   124  100  23  13  30   0  54 74  304   93 4672  1 20  79
 127 140 0 43740   124   52 108  68  81   0 265 82  294   40 10112  0 30  70
 1 268 0 44004   316   18 140  14  66   0 296 83  286   18 10925  0 25  75
 procs   memory     page                         faults      cpu
 r b w   avm   fre  flt  re  pi  po  fr  sr w0   in   sy  cs us sy  id
 0 239 0 44196   124  136  23  43  15   0  40 76  380   87 5878  0 15  85
 0 270 0 44196   336   48 136  34  70   0 277 78  318   39 9637  0 31  69
 0 229 0 44196   124  121  23  42  16   0  33 77  343   63 7186  1 24  75
 0 229 0 44536   124   89  60  43  21   0 134 74  349   76 7129  0 17  83
 0 266 0 45368   124   78 141  51  71   0 278 78  361   43 9139  0 30  70
 0 202 0 45644   124  123  45  41  12   0  99 75  448  132 5929  0 18  82
 1 226 0 45644   124   55  74  38  72   0 171 75  355   60 8164  0 33  67
 0 255 0 45372   124   47 111  46  36   0 241 78  351   54 8802  0 28  72
 0 240 0 45636   124   99  48  36  57   0 121 81  426   85 7350  0 30  70
 0 244 0 46204   124   63 123  38  36   0 213 79  397   57 8685  0 21  79
 0 203 0 46204   124   89  44  24  52   0 105 78  372   80 6416  0 22  78
 1 201 0 46216   124   81  69  53  30   0 162 77  410  149 6066  0 28  71
 0 261 0 46500   124   30 118  46  59   0 242 76  381   38 9514  0 30  70
 1 192 0 46544   124   88  28  13  34   0  61 79  387   78 5465  0 20  80
 0 272 0 46616   292  132 202 116 101   0 557 77  825  111 17351  0 25  74
 0 225 0 46524   124  127  36  42  32   0  74 76  397  104 6516  0 23  76
 0 258 0 46540   124   41 124  51  51   0 249 76  399   33 9562  0 31  69
 0 191 0 46540   124   89  34  12  30   0  65 78  424   80 5169  0 24  76
 procs   memory     page                         faults      cpu
 r b w   avm   fre  flt  re  pi  po  fr  sr w0   in   sy  cs us sy  id
 0 214 0 46540   124   60  69  54  39   0 178 75  379   57 5635  0 25  75
 0 260 0 47068   124   30 115  42  54   0 229 76  310   24 9079  0 30  70
 14 173 0 47068   124   93  40  17  37   0  75 75  305   58 5477  0 19  80
 89 148 0 47128   124   85  66  94  44   0 203 78  318   58 7289  0 24  75
 0 248 0 47280   124   24 120  17  53   0 210 79  295   29 9537  0 28  72
 0 213 0 47340   124  106  40  44  19   0 111 73  260   67 3973  0 13  87
 0 250 0 46976   124   50  47  47  71   0 162 80  268   34 7222  0 26  74
 0 225 0 47040   124   39 104  16  32   0 175 78  269   30 7255  0 23  77
 0 211 0 46776   124   80  53  51  24   0 138 76  259   65 4260  0 14  86
 85 148 0 46600   168   46  52  37  77   0 164 78  266   46 5585  0 22  78
 0 191 0 46600   124   64 113  23  45   0 189 77  332   84 5040  0 12  88
 0 192 0 46336   340   70  82  56  62   0 378 79  370   60 4107  0 11  89
 0 181 0 46072   336   50  86  34  52   0 346 84  281   56 2266  0  4  96
 1 130 0 45360   296   64  80  51  11   0 264 73  308   69 671  0  5  95
 1 112 0 40788   340   57  41  45  25   0 114 60  256   70  70  0  2  98
 1 97 0 38084   272   23  25  16  11   0  46 26  174  115  36  0  1  98
 1 97 0 37216   272    2   0   0   0   0   0  0  110   52   8  0  0 100
 procs   memory     page                         faults      cpu
 r b w   avm   fre  flt  re  pi  po  fr  sr w0   in   sy  cs us sy  id
 0 80 0 35336   328    7  10   4   6   0  26 28   54   29  14  0  5  95
[etc., ad nauseum, until we reboot...]