Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

problem with httpd hang



Hello,
for some time I have an issue with apache (2.2) hangs on
ftp.fr.netbsd.org (running a recent 7.0_BETA). When this happens,
port 80 is still open and accepts connections, but requests are not handled.
This seems to be because httpd doesn't do anything more, and zombies
are not properly reaped:
antioche:/home/bouyer>ps axuww |grep http
www       1428  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
root      1649  0.0  0.1  74892  5368 ?      Is    1:23PM  0:17.99 /usr/pkg/sbin/httpd -k start 
www       2374  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www       4303  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www       5076  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www       6364  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www       6642  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www       9104  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www       9956  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      19196  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      21177  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      21193  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      22053  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      23249  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      29014  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)
www      29271  0.0  0.0      0     0 ?      Z          -  0:00.00 (httpd)

antioche:/home/bouyer>ps axlww | grep http
  98  1428  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
   0  1649     1    0  85  0  74892  5368 pipe_wr Is   ?       0:17.99 /usr/pkg/sbin/httpd -k start 
  98  2374  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98  4303  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98  5076  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98  6364  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98  6642  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98  9104  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98  9956  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 19196  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 21177  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 21193  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 22053  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 23249  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 29014  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)
  98 29271  1649    0   0  0      0     0 -       Z    ?       0:00.00 (httpd)


It seems to be stuck on a pipe write. fstat tells me:
antioche:/home/bouyer>fstat -p 1649
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
root     httpd       1649   wd /              2 drwxr-xr-x     512 r 
root     httpd       1649    0 /          51846 crw-rw-rw-    null r 
root     httpd       1649    1 /          51846 crw-rw-rw-    null w 
root     httpd       1649    2 /var       42803 -rw-r--r--  131768080 w 
root     httpd       1649    3* crypto 0xfffffe810efad7e0
root     httpd       1649    4* internet stream tcp *:http
root     httpd       1649    5* internet6 stream tcp *:http
root     httpd       1649    6* crypto 0xfffffe810efad888
root     httpd       1649    7 flags 0x80034<ISTTY,MPSAFE,LOCKSWORK,CLEAN>
root     httpd       1649    8 flags 0x80034<ISTTY,MPSAFE,LOCKSWORK,CLEAN>
root     httpd       1649    9* pipe 0xfffffe810f783dc0 -> 0x0 w
root     httpd       1649   10 /          41817 -rw-r--r--      53 r 
root     httpd       1649   11* pipe 0xfffffe81da43aa28 <- 0xfffffe81d83a03e8 rn
root     httpd       1649   12* pipe 0xfffffe81d83a03e8 -> 0xfffffe81da43aa28 w
root     httpd       1649   13 /var       42823 -rw-r--r--  2358066 w 
root     httpd       1649   14 /var       42807 -rw-r--r--  3449145 w 
root     httpd       1649   15 /var       42807 -rw-r--r--  3449145 w 
root     httpd       1649   16 /var       42901 -rw-r--r--  121307 w 

to 2 pipes open in write mode: one with no more readers, and one
with only a single reader left, process 1649 itself.
I don't know if it's writing to 0xfffffe810f783dc0 and failed to get
a SIGPIPE, or if it's writing to 0xfffffe81d83a03e8 (in which case
it's a real deadlock, I guess httpd failed to properly close the read
end of its pipe after fork). It seems to be inherited by all child httpd
and maybe it's used for communications bewteen master and slaves.
The pipe on fd 9 is common to all processes started from /etc/rc.


gdb tells me about process 1649:
[Switching to LWP 1]
0x00007f7ff583c02a in write () from /usr/lib/libc.so.12
(gdb) where
#0  0x00007f7ff583c02a in write () from /usr/lib/libc.so.12
#1  0x00007f7ff6007388 in write () from /usr/lib/libpthread.so.1
#2  0x00007f7ff6c1832f in apr_file_write () from /usr/pkg/lib/libapr-1.so.0
#3  0x000000000044b7ce in pod_signal_internal ()
#4  0x000000000044ba75 in ap_mpm_pod_signal ()
#5  0x0000000000458ac6 in perform_idle_server_maintenance ()
#6  0x00000000004590b8 in ap_mpm_run ()
#7  0x0000000000425d7d in main ()

Attaching to the process with gdb and then exiting from gdb did un-hang the
process, it's working again. The pipes on fd 9, 11 and 12 are still there.

Any idea how to debug this further ?
This has been running fine for months, and started happening only a few days
ago (it's not related to an upgrade, I upgraded only to see if this would
fix the problem), I guess something in the load pattern has changed.


-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index