tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Problems with hangs under NetBSD-5.x



Hi,

I see similar hangs on a NetBSD 5.0 Web server.  Initially I thought they were 
related to WAPBL -- so I disabled that.  Last night, however, the system was 
hung again.  I now believe my hangs may be the same ones that Brian are 
seeing.

Does this warrant a PR?  The information is pretty limited.  I've attached the 
output from ps.

See also comments below.

Thanks,
Sverre

On Wed July 15 2009 14:21:40 Brian Buhrow wrote:
>       Hello.  Under NetBSD-5, I've been seeing weird hangs on one of my
> production systems.
> Here's what I'm seeing:
>
>       What I see is that the machines kernel runs fine.  Processes which
> don't write to the filesystem run fine.  Processes which write to the
> filesystem, either directly or indirectly via something like syslog
> eventually hang.  I'm not sure where they hang, because I don't have ddb in
> this kernel and I can't let the machine sit around when it happens so I
> can't get it into ddb and look.  I thought it might be the lockf problem
> which was recently fixed up in the NetBSD-5 branch, but I've got those
> fixes, and the problem still occurs.

I can access (tested using ls) all file systems except one.  Processes that 
attempt access to the "hung" file system end up in state "D" with wchan 
"tstile".

> Here are some data points:
>
> 1.  My system is a NetBSD-5 system with netbsd-5 sources as of July 10,
> 2009.

My system is compiled from the netbsd-5-0 branch in May.  Are there changes 
since then that might fix this?

> 2.  I have a number of machines running this code in production without a
> problem.
>
> 3.  this particular machine is doing a lot of web serving.

Same here.

> 4.  It is a pair of hyperthreaded Intel cpus, so it shows up as 4 cpus, but
> it's actually only 2.

Same here.

> 5.  I'm not using WAPBL, in fact, I don't have it built in my kernel.  I
> am,however, using ffs with softdep.

No softdep and I've now seen this with and without WAPBL.

> 6.  Processes which don't write to the filesystem, either directly or
> indirectly such as via syslog, do not hang.  Only processes which touch the
> filesystem hang.

Looks like reads are hanging too -- but perhaps it hangs on the meta data 
update (atime?).  E.g., /bin/ls hangs with no output.

> 7. On my system, it takes about 18-21 hours of normal operation for the
> problem to show up.

It takes much longer here.  Last time it was OK for a month.

> 8.  I'm only seeing this behavior on one system.  One difference between
> it an the others running 5.x is that it has more logical CPUs than the
> others.

Did you attempt to turn off hyperthreading for this system before downgrading 
to NetBSD 4?

>       I've seen some reports that there may be some issues with WAPBL, but
> since I don't have that as an option in my kernel, I don't think the
> problem is simply a WAPBL issue.

I have seen it with and without WAPBL.

> Here is the dmesg for the afflicted machine, for what it's worth.
>
>       Unfortunately, this is a production machine, so I can't do a lot of
> testing with it.  In fact, I hav to drop it back to 4.x today because it's
> beginning to impact our operations.  Under 4.x it ran continuously without
> a hitch for over a year.
>
>       Any thoughts or ideas on how to get to the bottom of this would be
> greatly appreciated.

Same here.

<dmesg output deleted>
USER      PID %CPU %MEM    VSZ   RSS TTY    STAT STARTED      TIME COMMAND UID  
 PID  PPID CPU PRI NI    VSZ   RSS WCHAN   STAT TTY         TIME COMMAND
root        0  0.0  1.5      0 31828 ?      OKl  25Jun09 173:31.56 [system   0  
   0     0   0   0  0      0 31828 -       OKl  ?      173:31.56 [system
root        1  0.0  0.0   2932   784 ?      Is   25Jun09   0:00.02 init      0  
   1     0  43  85  0   2932   784 wait    Is   ?        0:00.02 init   
root      125  0.0  0.0   2932   852 ?      Ss   25Jun09   0:04.01 /usr/sb   0  
 125     1   0  85  0   2932   852 kqueue  Ss   ?        0:04.01 /usr/sb
root      238  0.0  0.3   5476  5672 ?      Ds   25Jun09   1:29.55 /usr/sb   0  
 238     1   0  85  0   5476  5672 vm_map  Ds   ?        1:29.55 /usr/sb
root      255  0.0  0.1   5936  1592 ?      Is   25Jun09   0:00.01 /usr/sb   0  
 255     1   0  85  0   5936  1592 select  Is   ?        0:00.01 /usr/sb
root      286  0.0  0.0   2900   916 ?      Is   25Jun09   0:04.78 /usr/sb   0  
 286     1   0  85  0   2900   916 nanoslp Is   ?        0:04.78 /usr/sb
root      341  0.0  0.1   4796  1524 ?      Ds   25Jun09   0:11.09 /usr/li   0  
 341     1   0  85  0   4796  1524 vm_map  Ds   ?        0:11.09 /usr/li
postfix   347  0.0  0.1   4796  1716 ?      I    25Jun09   0:04.43 qmgr -l  12  
 347   341   0  85  0   4796  1716 kqueue  I    ?        0:04.43 qmgr -l
root      370  0.0  0.0   2972   980 ?      Ss   25Jun09   0:00.04 /usr/sb   0  
 370     1   0  85  0   2972   980 kqueue  Ss   ?        0:00.04 /usr/sb
postfix  2910  0.0  0.0      0     0 ?      ZW         -   0:00.00 (pickup  12  
2910   341   0   0  0      0     0 -       ZW   ?        0:00.00 (pickup
www      5046  0.0  0.2  22936  3864 ?      D     3:18AM   0:00.00 /usr/pk  99  
5046 13735   0 117  0  22936  3864 tstile  D    ?        0:00.00 /usr/pk
www      5261  0.0  0.2  22936  3904 ?      D     3:18AM   0:00.00 /usr/pk  99  
5261 13735   0 117  0  22936  3904 tstile  D    ?        0:00.00 /usr/pk
www      5869  0.0  0.1  17476  2512 ?      I    Sun12AM   0:00.00 /usr/pk  99  
5869 13735   0  85  0  17476  2512 socket  I    ?        0:00.00 /usr/pk
www      7013  0.0  0.2  22936  4636 ?      D    10:05PM   0:02.02 /usr/pk  99  
7013 13735   0 117  0  22936  4636 tstile  D    ?        0:02.02 /usr/pk
www      7113  0.0  0.2  22936  3876 ?      D     3:16AM   0:00.00 /usr/pk  99  
7113 13735   0 117  0  22936  3876 tstile  D    ?        0:00.00 /usr/pk
root     7351  0.0  0.0   2824   672 ?      I     3:15AM   0:00.00 tee /va   0  
7351 25824   0  85  0   2824   672 piperd  I    ?        0:00.00 tee /va
www      7618  0.0  0.2  22936  4112 ?      D     3:15AM   0:00.01 /usr/pk  99  
7618 13735   0 117  0  22936  4112 tstile  D    ?        0:00.01 /usr/pk
www      7695  0.0  0.8  33176 16864 ?      D     2:09AM   0:00.65 /usr/pk  99  
7695 13735   0 117  0  33176 16864 tstile  D    ?        0:00.65 /usr/pk
www      8361  0.0  0.2  22936  3856 ?      D     3:16AM   0:00.00 /usr/pk  99  
8361 13735   0 117  0  22936  3856 tstile  D    ?        0:00.00 /usr/pk
root     8763  0.0  0.1   4796  1808 ?      I     3:17AM   0:00.01 postdro   0  
8763 21466   0  85  0   4796  1808 netio   I    ?        0:00.01 postdro
www     10348  0.0  0.2  22936  3896 ?      D     3:17AM   0:00.00 /usr/pk  99 
10348 13735   0 117  0  22936  3896 tstile  D    ?        0:00.00 /usr/pk
root    11255  0.0  0.5  20248 10388 ?      Ss   30Jun09   0:45.02 httpd:    0 
11255     1   0  85  0  20248 10388 select  Ss   ?        0:45.02 httpd: 
www     11295  0.0  0.2  22936  3900 ?      D     3:18AM   0:00.00 /usr/pk  99 
11295 13735   0 117  0  22936  3900 tstile  D    ?        0:00.00 /usr/pk
www     12468  0.0  0.2  22936  3880 ?      D     3:16AM   0:00.00 /usr/pk  99 
12468 13735   0 117  0  22936  3880 tstile  D    ?        0:00.00 /usr/pk
www     12944  0.0  0.2  22936  3876 ?      D     3:18AM   0:00.00 /usr/pk  99 
12944 13735   0 117  0  22936  3876 tstile  D    ?        0:00.00 /usr/pk
root    13735  0.0  0.5  22936 11280 ?      Ss   30Jun09   0:51.75 /usr/pk   0 
13735     1   0  85  0  22936 11280 select  Ss   ?        0:51.75 /usr/pk
www     14441  0.0  0.2  22936  3844 ?      D     3:18AM   0:00.00 /usr/pk  99 
14441 13735   0 117  0  22936  3844 tstile  D    ?        0:00.00 /usr/pk
root    15076  0.0  0.4  16152  8512 ?      Ss   30Jun09   0:43.58 httpd:    0 
15076     1   0  85  0  16152  8512 select  Ss   ?        0:43.58 httpd: 
www     15452  0.0  1.4  40444 30096 ?      I    12:04AM   0:16.53 httpd:   99 
15452 15076   0  85  0  40444 30096 socket  I    ?        0:16.53 httpd: 
www     16020  0.0  0.2  22936  3880 ?      D     3:17AM   0:00.00 /usr/pk  99 
16020 13735   0 117  0  22936  3880 tstile  D    ?        0:00.00 /usr/pk
www     16837  0.0  0.1  20248  2560 ?      I    Sun12AM   0:00.00 httpd:   99 
16837 11255   0  85  0  20248  2560 socket  I    ?        0:00.00 httpd: 
www     16847  0.0  0.2  22936  3864 ?      D     3:19AM   0:00.00 /usr/pk  99 
16847 13735   0 117  0  22936  3864 tstile  D    ?        0:00.00 /usr/pk
www     17100  0.0  0.2  22936  3876 ?      D     3:16AM   0:00.00 /usr/pk  99 
17100 13735   0 117  0  22936  3876 tstile  D    ?        0:00.00 /usr/pk
www     17472  0.0  0.2  22936  3844 ?      D     3:18AM   0:00.00 /usr/pk  99 
17472 13735   0 117  0  22936  3844 tstile  D    ?        0:00.00 /usr/pk
www     17557  0.0  1.2  36404 25148 ?      I     3:05AM   0:11.10 httpd:   99 
17557 11255   0  85  0  36404 25148 socket  I    ?        0:11.10 httpd: 
www     17633  0.0  0.1  16152  2532 ?      I    Sun12AM   0:00.00 httpd:   99 
17633 15076   0  85  0  16152  2532 socket  I    ?        0:00.00 httpd: 
www     17727  0.0  0.2  22936  3844 ?      D     3:17AM   0:00.00 /usr/pk  99 
17727 13735   0 117  0  22936  3844 tstile  D    ?        0:00.00 /usr/pk
root    18172  0.0  0.0   2948   992 ?      Is    3:15AM   0:00.00 /bin/sh   0 
18172 21446   0  85  0   2948   992 wait    Is   ?        0:00.00 /bin/sh
www     18578  0.0  0.2  22936  4116 ?      D     3:15AM   0:00.01 /usr/pk  99 
18578 13735   0 117  0  22936  4116 tstile  D    ?        0:00.01 /usr/pk
www     18715  0.0  0.2  22936  3892 ?      D     3:15AM   0:00.01 /usr/pk  99 
18715 13735   0 117  0  22936  3892 tstile  D    ?        0:00.01 /usr/pk
www     18717  0.0  0.2  22936  3856 ?      D     3:15AM   0:00.01 /usr/pk  99 
18717 13735   0 117  0  22936  3856 tstile  D    ?        0:00.01 /usr/pk
root    18937  0.0  0.0   2948   996 ?      Ss    5:47AM   0:00.00 sh -c p   0 
18937 22389   0  85  0   2948   996 wait    Ss   ?        0:00.00 sh -c p
www     19024  0.0  0.2  22936  3852 ?      D     3:16AM   0:00.00 /usr/pk  99 
19024 13735   0 117  0  22936  3852 tstile  D    ?        0:00.00 /usr/pk
root    19154  0.0  0.1   5556  2388 ?      Ds    5:00AM   0:00.02 ftpd: v   0 
19154   370   0 117  0   5556  2388 tstile  Ds   ?        0:00.02 ftpd: v
www     19208  0.0  0.2  22936  3844 ?      D     3:15AM   0:00.00 /usr/pk  99 
19208 13735   0 117  0  22936  3844 tstile  D    ?        0:00.00 /usr/pk
www     19389  0.0  0.2  22936  3840 ?      D     3:16AM   0:00.00 /usr/pk  99 
19389 13735   0 117  0  22936  3840 tstile  D    ?        0:00.00 /usr/pk
root    19481  0.0  0.4  28820  8244 ?      D     3:15AM   0:00.43 /usr/lo   0 
19481 24007   0 117  0  28820  8244 tstile  D    ?        0:00.43 /usr/lo
root    19606  0.0  0.1   6260  1700 ?      D     4:15AM   0:00.00 /usr/lo   0 
19606 25934   0 117  0   6260  1700 tstile  D    ?        0:00.00 /usr/lo
www     19777  0.0  1.1  36404 22408 ?      I     3:11AM   0:05.43 httpd:   99 
19777 11255   0  85  0  36404 22408 socket  I    ?        0:05.43 httpd: 
root    19818  0.0  0.0   2900  1036 ?      I     4:15AM   0:00.00 cron: r   0 
19818   286   0  85  0   2900  1036 piperd  I    ?        0:00.00 cron: r
www     19889  0.0  1.1  33260 23132 ?      D     2:45AM   0:06.37 httpd:   99 
19889 15076   0 117  0  33260 23132 tstile  D    ?        0:06.37 httpd: 
www     19897  0.0  1.8  46644 37232 ?      I     2:55AM   0:46.55 httpd:   99 
19897 11255   0  85  0  46644 37232 socket  I    ?        0:46.55 httpd: 
www     20584  0.0  0.2  22936  3844 ?      D     3:15AM   0:00.01 /usr/pk  99 
20584 13735   0 117  0  22936  3844 tstile  D    ?        0:00.01 /usr/pk
www     20646  0.0  0.2  22936  3844 ?      D     3:15AM   0:00.01 /usr/pk  99 
20646 13735   0 117  0  22936  3844 tstile  D    ?        0:00.01 /usr/pk
www     20661  0.0  0.2  22936  3888 ?      D     3:18AM   0:00.00 /usr/pk  99 
20661 13735   0 117  0  22936  3888 tstile  D    ?        0:00.00 /usr/pk
www     21336  0.0  0.8  32152 16184 ?      D     2:09AM   0:00.40 /usr/pk  99 
21336 13735   0 117  0  32152 16184 tstile  D    ?        0:00.40 /usr/pk
www     21379  0.0  0.2  22936  3884 ?      D     3:18AM   0:00.00 /usr/pk  99 
21379 13735   0 117  0  22936  3884 tstile  D    ?        0:00.00 /usr/pk
root    21446  0.0  0.1   2900  1072 ?      I     3:15AM   0:00.00 cron: r   0 
21446   286   0  85  0   2900  1072 piperd  I    ?        0:00.00 cron: r
root    21466  0.0  0.1   4796  1828 ?      I     3:17AM   0:00.01 sendmai   0 
21466 21446   0  85  0   4796  1828 piperd  I    ?        0:00.01 sendmai
www     21638  0.0  0.2  16152  3264 ?      I     3:15AM   0:00.00 httpd:   99 
21638 15076   0  85  0  16152  3264 socket  I    ?        0:00.00 httpd: 
www     21880  0.0  0.2  22936  3860 ?      D     3:18AM   0:00.00 /usr/pk  99 
21880 13735   0 117  0  22936  3860 tstile  D    ?        0:00.00 /usr/pk
www     22139  0.0  0.2  22936  3884 ?      D     3:15AM   0:00.01 /usr/pk  99 
22139 13735   0 117  0  22936  3884 tstile  D    ?        0:00.01 /usr/pk
www     22289  0.0  1.0  36248 20244 ?      D     5:51AM   0:01.32 /usr/pk  99 
22289 13735   0 117  0  36248 20244 vm_map  D    ?        0:01.32 /usr/pk
www     22295  0.0  0.2  22936  3864 ?      D     3:19AM   0:00.00 /usr/pk  99 
22295 13735   0 117  0  22936  3864 tstile  D    ?        0:00.00 /usr/pk
www     22318  0.0  0.2  22936  3880 ?      D     3:15AM   0:00.01 /usr/pk  99 
22318 13735   0 117  0  22936  3880 tstile  D    ?        0:00.01 /usr/pk
root    22389  0.0  0.1   3396  1320 ?      S     5:47AM   0:00.01 rshd -L   0 
22389   370   0  85  0   3396  1320 select  S    ?        0:00.01 rshd -L
www     22442  0.0  0.2  22936  3896 ?      D     3:15AM   0:00.01 /usr/pk  99 
22442 13735   0 117  0  22936  3896 tstile  D    ?        0:00.01 /usr/pk
root    22511  0.0  0.1   2948  1112 ?      I     3:15AM   0:00.01 /bin/sh   0 
22511 25824   0  85  0   2948  1112 wait    I    ?        0:00.01 /bin/sh
www     22995  0.0  0.2  22936  3856 ?      D     3:16AM   0:00.00 /usr/pk  99 
22995 13735   0 117  0  22936  3856 tstile  D    ?        0:00.00 /usr/pk
root    23143  0.0  0.1   4796  1840 ?      I     3:15AM   0:00.01 sendmai   0 
23143 25824   0  85  0   4796  1840 piperd  I    ?        0:00.01 sendmai
www     23158  0.0  0.2  22936  3844 ?      D     3:17AM   0:00.00 /usr/pk  99 
23158 13735   0 117  0  22936  3844 tstile  D    ?        0:00.00 /usr/pk
root    23412  0.0  0.0   2900  1036 ?      I     3:15AM   0:00.00 cron: r   0 
23412   286   0  85  0   2900  1036 piperd  I    ?        0:00.00 cron: r
www     23723  0.0  0.2  22936  3864 ?      D     3:18AM   0:00.00 /usr/pk  99 
23723 13735   0 117  0  22936  3864 tstile  D    ?        0:00.00 /usr/pk
root    23815  0.0  0.0   2952   848 ?      O     5:47AM   0:00.00 ps -axu   0 
23815 18937   0  43  0   2952   848 -       O    ?        0:00.00 ps -axu
www     23859  0.0  0.2  22936  3900 ?      D     3:17AM   0:00.00 /usr/pk  99 
23859 13735   0 117  0  22936  3900 tstile  D    ?        0:00.00 /usr/pk
root    24007  0.0  0.0   2948  1000 ?      I     3:15AM   0:00.00 /bin/sh   0 
24007 18172   0  85  0   2948  1000 wait    I    ?        0:00.00 /bin/sh
www     24092  0.0  0.2  22936  3860 ?      D     3:18AM   0:00.00 /usr/pk  99 
24092 13735   0 117  0  22936  3860 tstile  D    ?        0:00.00 /usr/pk
www     24211  0.0  0.2  22936  3848 ?      D     3:18AM   0:00.00 /usr/pk  99 
24211 13735   0 117  0  22936  3848 tstile  D    ?        0:00.00 /usr/pk
www     24285  0.0  0.2  22936  3840 ?      D     3:16AM   0:00.00 /usr/pk  99 
24285 13735   0 117  0  22936  3840 tstile  D    ?        0:00.00 /usr/pk
www     24290  0.0  0.2  22936  3872 ?      D     3:18AM   0:00.00 /usr/pk  99 
24290 13735   0 117  0  22936  3872 tstile  D    ?        0:00.00 /usr/pk
www     24609  0.0  0.2  22936  3848 ?      D     3:18AM   0:00.00 /usr/pk  99 
24609 13735   0 117  0  22936  3848 tstile  D    ?        0:00.00 /usr/pk
www     24648  0.0  0.2  22936  3864 ?      D     3:19AM   0:00.00 /usr/pk  99 
24648 13735   0 117  0  22936  3864 tstile  D    ?        0:00.00 /usr/pk
www     25610  0.0  0.2  22936  3856 ?      D     3:16AM   0:00.00 /usr/pk  99 
25610 13735   0 117  0  22936  3856 tstile  D    ?        0:00.00 /usr/pk
www     25624  0.0  0.2  22936  3884 ?      D     3:18AM   0:00.00 /usr/pk  99 
25624 13735   0 117  0  22936  3884 tstile  D    ?        0:00.00 /usr/pk
www     25803  0.0  0.2  22936  4508 ?      D     2:46AM   0:00.05 /usr/pk  99 
25803 13735   0 117  0  22936  4508 tstile  D    ?        0:00.05 /usr/pk
root    25824  0.0  0.0   2948  1028 ?      Is    3:15AM   0:00.00 /bin/sh   0 
25824 23412   0  85  0   2948  1028 wait    Is   ?        0:00.00 /bin/sh
root    25934  0.0  0.0   2948   992 ?      Is    4:15AM   0:00.00 /bin/sh   0 
25934 19818   0  85  0   2948   992 wait    Is   ?        0:00.00 /bin/sh
root    26352  0.0  0.0   2896   980 ?      D     3:15AM   0:00.75 find /    0 
26352 22511   0 117  0   2896   980 tstile  D    ?        0:00.75 find / 
www     26664  0.0  0.2  22936  3880 ?      D     3:17AM   0:00.00 /usr/pk  99 
26664 13735   0 117  0  22936  3880 tstile  D    ?        0:00.00 /usr/pk
root    26847  0.0  0.1   4796  1808 ?      I     3:15AM   0:00.01 postdro   0 
26847 23143   0  85  0   4796  1808 netio   I    ?        0:00.01 postdro
www     27657  0.0  0.2  22936  3860 ?      D     3:17AM   0:00.00 /usr/pk  99 
27657 13735   0 117  0  22936  3860 tstile  D    ?        0:00.00 /usr/pk
www     27802  0.0  1.0  38296 21644 ?      D     2:12AM   0:00.73 /usr/pk  99 
27802 13735   0 117  0  38296 21644 tstile  D    ?        0:00.73 /usr/pk
root      369  0.0  0.0   2948   876 ttyE0- I    25Jun09   0:00.01 /bin/sh   0  
 369     1   0  85  0   2948   876 wait    I    ttyE0-   0:00.01 /bin/sh
mysql     429  0.0  1.9 124556 38976 ttyE0- Il   25Jun09 631:30.41 /usr/pk  98  
 429   369   0  85  0 124556 38976 select  Il   ttyE0- 631:30.41 /usr/pk
root      524  0.0  0.0   2912   868 ttyE0  Is+  25Jun09   0:00.00 /usr/li   0  
 524     1   0  85  0   2912   868 ttyraw  Is+  ttyE0    0:00.00 /usr/li
root      419  0.0  0.0   2912   868 ttyE1  Is+  25Jun09   0:00.00 /usr/li   0  
 419     1   0  85  0   2912   868 ttyraw  Is+  ttyE1    0:00.00 /usr/li
root      470  0.0  0.0   2912   868 ttyE2  Is+  25Jun09   0:00.00 /usr/li   0  
 470     1   0  85  0   2912   868 ttyraw  Is+  ttyE2    0:00.00 /usr/li
root      428  0.0  0.0   2912   868 ttyE3  Is+  25Jun09   0:00.01 /usr/li   0  
 428     1   0  85  0   2912   868 ttyraw  Is+  ttyE3    0:00.01 /usr/li


Home | Main Index | Thread Index | Old Index