NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/46955: process deadlock (tstile) runing amanda sendsize - rename bug in tmpfs ???



>Number:         46955
>Category:       kern
>Synopsis:       process deadlock (tstile) runing amanda sendsize - rename bug 
>in tmpfs ???
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Sep 14 08:10:00 +0000 2012
>Originator:     Dr. Wolfgang Stukenbrock
>Release:        NetBSD 5.1.2
>Organization:
Dr. Nagler & Company GmbH
>Environment:
        
        
System: NetBSD e003 5.1.2 NetBSD 5.1.2 (NSW-E003) #0: Mon Aug 27 10:33:31 CEST 
2012  wgstuken@e003:/usr/src/sys/arch/amd64/compile/NSW-E003 amd64
Architecture: x86_64
Machine: amd64
>Description:
        The system freezes over nigth "sometimes" - no login possible, reset 
required.
        I've added some cron jobs that will report vmstat, ps and df output to 
syslog that run every30 minutes.
        They report the following:
Sep 11 03:00:00 e003 vmstat:  procs    memory      page                       
disks   faults      cpu 
Sep 11 03:00:00 e003 vmstat:  r b w    avm    fre  flt  re  pi   po   fr   sr 
c0 w0   in   sy  cs us sy id 
Sep 11 03:00:00 e003 vmstat:  0 0 0 2552356 198788  69   0   0    0    0    0  
0  3    6  300  60  0  0 100 
Sep 11 03:00:01 e003 vmstat:  0 0 0 2552384 198760  11   0   0    0    0    0  
0  1    3 1081  91  0 0 100 
Sep 11 03:00:02 e003 vmstat:  0 0 0 2552384 198760   0   0   0    0    0    0  
0  2    4   36  45  0 0 100 
Sep 11 03:01:01 e003 ps:  UID   PID  PPID NLWP WCHAN   STAT     TIME   RSZ    
VSZ COMMAND 
Sep 11 03:01:01 e003 ps:    0     0     0   44 -       OKl  37:52.18 19156      
0 [system] 
Sep 11 03:01:01 e003 ps:    0     1     0    1 wait    Is    0:00.01  1044   
6520 init  
Sep 11 03:01:01 e003 ps:    0   117     1    5 sigwait Isl   1:40.33 77952 
109276 /usr/sbin/named -c /var/named/named.conf -4  
Sep 11 03:01:01 e003 ps:    0   165     1    1 kqueue  Ss    0:02.33  1068   
6544 /usr/sbin/syslogd -s  
Sep 11 03:01:01 e003 ps:    0   172     1    1 nanoslp Ss    0:06.33  1004   
6504 /usr/sbin/ipmon -Dps  
Sep 11 03:01:01 e003 ps:    0   188     1    1 select  Ss    0:00.59  1228   
6532 /usr/sbin/rpcbind -l  
Sep 11 03:01:01 e003 ps:    0   201     1    1 select  Ss    0:08.96  1036   
5464 /usr/sbin/ypbind  
Sep 11 03:01:01 e003 ps: 1002   297   775    1 select  I     0:00.01  3372  
35760 sshd: wgstuken@ttyp0  
Sep 11 03:01:01 e003 ps:    0   311     1    1 pause   Ss    0:13.74  5420  
11320 /usr/sbin/ntpd  
Sep 11 03:01:01 e003 ps:    0   313     1    1 select  Is    0:00.00  2080  
24808 /usr/sbin/sshd  
Sep 11 03:01:01 e003 ps:    0   380   313    1 netio   Is    0:00.01  4564  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps:    0   407   313    1 netio   Is    0:00.01  4564  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps:   12   459   462    1 kqueue  I     0:01.46  2540  
12144 qmgr -l -t unix -u  
Sep 11 03:01:01 e003 ps: 1002   460   407    1 select  I     0:00.00  3372  
35760 sshd: wgstuken@ttyp2  
Sep 11 03:01:01 e003 ps:    0   462     1    1 kqueue  Ss    0:02.53  2368  
12028 /usr/libexec/postfix/master  
Sep 11 03:01:01 e003 ps:  500   509     1    1 wait    Is    0:00.00   796   
4416 /etc/pkg/sbin/watch_server -b -p /var/run/nswSW.pid -c 
/etc/pkg/watch-server.cfg  
Sep 11 03:01:01 e003 ps:  500   518   509    1 select  I     0:00.06   956   
4416 /etc/pkg/sbin/watch_server -b -p /var/run/nswSW.pid -c 
/etc/pkg/watch-server.cfg  
Sep 11 03:01:01 e003 ps:    0   542     1    1 kqueue  Is    0:00.01  1544   
7616 /usr/sbin/inetd -l  
Sep 11 03:01:01 e003 ps:    0   575     1    1 nanoslp Ss    0:01.64  1056   
5468 /usr/sbin/cron  
Sep 11 03:01:01 e003 ps:    0   601   313    1 netio   Is    0:00.01  4564  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps: 1002   666   792    1 select  I     0:00.00  3372  
35760 sshd: wgstuken@ttyp1  
Sep 11 03:01:01 e003 ps:    0   712     1    1 select  Ss    0:06.41 17604 
176944 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:    0   743     1    3 select  Ssl   0:02.15  5360  
35056 /usr/pkg/sbin/bacula-fd -c /etc/pkg/bacula/bacula-fd.conf -g bacula  
Sep 11 03:01:01 e003 ps:    0   775   313    1 netio   Is    0:00.02  4508  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps:    0   792   313    1 netio   Is    0:00.01  4564  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps:    0   875   313    1 netio   Is    0:00.01  4564  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps: 1002   894   380    1 select  I     0:00.00  3372  
35760 sshd: wgstuken@ttyp5  
Sep 11 03:01:01 e003 ps: 1002  1005   875    1 select  I     0:00.00  3372  
35760 sshd: wgstuken@ttyp4  
Sep 11 03:01:01 e003 ps: 1002  1115   601    1 select  I     0:00.00  3372  
35760 sshd: wgstuken@ttyp3  
Sep 11 03:01:01 e003 ps:    0  1214   313    1 netio   Is    0:00.01  4588  
35760 sshd: wgstuken [priv]  
Sep 11 03:01:01 e003 ps: 1002  1360  1214    1 select  I     0:00.01  3372  
35760 sshd: wgstuken@ttyp6  
Sep 11 03:01:01 e003 ps:  802  1440   712    1 semwait I     0:00.73 10000 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  2426   712    1 semwait I     0:01.13  9944 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  2585   712    1 semwait I     0:00.34  9804 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  2607   712    1 socket  I     0:00.00  4440 
115164 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  3683   712    1 semwait I     0:01.00  9920 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  4193   712    1 semwait I     0:00.88 10048 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  6004   712    1 semwait I     0:00.90  9968 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  7014   712    1 semwait I     0:01.02 10004 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  7061   712    1 semwait I     0:01.00  9948 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  7202   712    1 semwait I     0:01.11  9940 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:    0  7543 15040    1 tstile  D     0:00.03  1424  
10132 runtar NSW-backup /usr/pkg/bin/gtar --create --file /dev/null --directory 
/var/log --one-file-system --listed-incremental 
/var/amanda/gnutar-lists/e003_var_log_0.new --sparse --ignore-failed-read 
--totals --exclude-from /tmp/amanda/sendsize._var_log.20120910233003.exclude .  
Sep 11 03:01:01 e003 ps:  802  7768   712    1 semwait I     0:00.98 10024 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802  7929   712    1 semwait I     0:00.41  9960 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:    0 10744  7543    1 -       ZW    0:00.00     0      
0 (sh) 
Sep 11 03:01:01 e003 ps:  802 11307   712    1 semwait I     0:00.41  9924 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  800 11876 26376    1 wait    I     0:00.00  1576  
10160 /usr/pkg/libexec/sendsize amandad bsdtcp  
Sep 11 03:01:01 e003 ps:  802 11955   712    1 kqueue  S     0:00.28  9936 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:    0 12002   575    1 piperd  S     0:00.00  1200   
5468 cron: running job  
Sep 11 03:01:01 e003 ps:  802 13211   712    1 semwait I     0:00.95  9944 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  800 13326 11876    1 piperd  I     0:00.00  1364  
10192 /usr/pkg/libexec/sendsize amandad bsdtcp  
Sep 11 03:01:01 e003 ps:    0 13686 15453    1 -       ZW    0:00.00     0      
0 (sh) 
Sep 11 03:01:01 e003 ps:  802 14111   712    1 semwait I     0:01.02 10000 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:    0 15029 25894    1 -       ZW    0:00.00     0      
0 (sh) 
Sep 11 03:01:01 e003 ps:  800 15040 11876    1 piperd  I     0:00.00  1364  
10192 /usr/pkg/libexec/sendsize amandad bsdtcp  
Sep 11 03:01:01 e003 ps:  802 15174   712    1 semwait I     0:01.07  9992 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:    0 15453 19855    1 tstile  D     0:00.03  1424  
10132 runtar NSW-backup /usr/pkg/bin/gtar --create --file /dev/null --directory 
/ --one-file-system --listed-incremental /var/amanda/gnutar-lists/e003__0.new 
--sparse --ignore-failed-read --totals --exclude-from 
/tmp/amanda/sendsize._.20120910233003.exclude .  
Sep 11 03:01:01 e003 ps:  802 15908   712    1 semwait I     0:00.89  9920 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  500 16327 12002    1 wait    Ss    0:00.00  1140   
6664 /bin/sh -c /bin/ps -axwwo "uid pid ppid nlwp wchan state time rsz vsz 
command" | logger -t ps -p local1.info  
Sep 11 03:01:01 e003 ps:  802 16924   712    1 semwait I     0:00.97  9972 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  802 19410   712    1 semwait I     0:00.36  9928 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  800 19855 11876    1 piperd  I     0:00.00  1352  
10192 /usr/pkg/libexec/sendsize amandad bsdtcp  
Sep 11 03:01:01 e003 ps:   12 21041   462    1 kqueue  S     0:00.00  2432  
12084 pickup -l -t fifo -u  
Sep 11 03:01:01 e003 ps:  500 25050 16327    1 piperd  S     0:00.00   804   
4328 logger -t ps -p local1.info  
Sep 11 03:01:01 e003 ps:    0 25894 13326    1 tstile  D     0:00.00  1420  
10132 runtar NSW-backup /usr/pkg/bin/gtar --create --file /dev/null --directory 
/home.stand --one-file-system --listed-incremental 
/var/amanda/gnutar-lists/e003_home.stand_0.new --sparse --ignore-failed-read 
--totals --exclude-from /tmp/amanda/sendsize._home.stand.20120910233003.exclude 
.  
Sep 11 03:01:01 e003 ps:  800 26376   542    1 select  I     0:00.00  1588   
9156 amandad bsdtcp amdump amindexd amidxtaped  
Sep 11 03:01:01 e003 ps:  802 28274   712    1 semwait I     0:00.91 10068 
177968 /usr/pkg/sbin/httpd -k start  
Sep 11 03:01:01 e003 ps:  500 28676 16327    1 -       O     0:00.00   932   
6552 /bin/ps -axwwo uid pid ppid nlwp wchan state time rsz vsz command  
Sep 11 03:01:01 e003 ps:    0   404   530    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   530   656    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   558   874    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0   656   558    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   662   404    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002   874   297    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0   268   872    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   591   268    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   607   891    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0   851   591    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   872   607    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002   891   666    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0   469   915    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   599   469    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   849  1037    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0   915   849    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002  1037   460    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0   161   476    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0   357   803    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002   476  1115    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0   489   807    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   803   161    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   807   357    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   436   562    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   562   750    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0   694   436    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   698   824    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002   750  1005    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0   824   694    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   321   767    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0   393  1101    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002   767   894    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0   785   321    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   836   393    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0  1101   785    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps: 1002   706  1360    1 pause   Is    0:00.00  1232   
4508 -csh  
Sep 11 03:01:01 e003 ps:    0  1181   706    1 pause   I     0:00.01  1204   
4508 -csh -m  
Sep 11 03:01:01 e003 ps:    0  1200  1181    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0  1448  1450    1 ttyraw  I+    0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0  1450  1200    1 pause   I     0:00.00  1204   
4508 -sh (csh) 
Sep 11 03:01:01 e003 ps:    0   497     1    1 wait    I     0:00.00  1180   
6664 /bin/sh /usr/pkg/bin/mysqld_safe --user=mysql --datadir=/var/mysql 
--pid-file=/var/mysql/e003.pid --skip-thread-priority  
Sep 11 03:01:01 e003 ps:  801   755   497   10 select  Il    2:57.45 20124  
82068 /usr/pkg/libexec/mysqld --basedir=/usr/pkg --datadir=/var/mysql 
--user=mysql --skip-thread-priority --log-error=/var/mysql/e003.err 
--pid-file=/var/mysql/e003.pid  
Sep 11 03:01:01 e003 ps:    0   778     1    1 ttyraw  Is+   0:00.00  1136   
6500 /usr/libexec/getty std.38400 console  
Sep 11 03:02:00 e003 df: Filesystem   1K-blocks       Used      Avail %Cap 
Mounted on 
Sep 11 03:02:00 e003 df: /dev/raid0a       96247      46238      45197  50% / 
Sep 11 03:02:00 e003 df: /dev/raid0f    16264494      59196   15392074   0% 
/var 
Sep 11 03:02:00 e003 df: /dev/raid0e     8132206    5899618    1825978  76% 
/usr 
Sep 11 03:02:00 e003 df: /dev/raid0g    30497948        164   28972888   0% 
/nc-www 
Sep 11 03:02:00 e003 df: /dev/raid0h    18737934         58   17800980   0% 
/home.stand 
Sep 11 03:02:00 e003 df: procfs                4          4          0 100% 
/proc 
Sep 11 03:02:00 e003 df: kernfs                1          1          0 100% 
/kern 
Sep 11 03:02:00 e003 df: tmpfs           1048576        736    1047840   0% 
/tmp 

        Next output is at Sep 12 09:00:01 from vmstat.
        The backup is started between 23:00 and 23:30 - the listing from 23:00 
does not show 
        the amanda process, At 23:30 the state shown above is already reached.
        At 03:15 the daily cronjob is started and I think that this one will 
finaly lock up everything.

        The runtar command is a binary from amanda software.
        It will finaly execve to the gnutar binary and this has still not 
happen.
        In front of that it uses popen("<gnutar> --version 2>&1", "r") to 
determine
        the gnutar binary version.
        It does a fgets from the stream and than simply forget FILE* - no 
pclose() done on it - this looks
        like a bug in amanda-client software ... (I've send a bug report for 
too.)
        It then does a rename() of the debug-output file - so the rename-bug 
may also be relevant.
        The rename() happens in /tmp - this is tmpfs.

        The rename-bug-fix from 5.1_STABLE has been integrated into this kernel.
        updated files:
          src/sys/ufs/lfs/lfs_vnops.c
          src/usr.bin/pmap/pmap.h
          src/sys/ufs/ufs/inode.h
          src/sys/ufs/ufs/ufs_extern.h
          src/sys/ufs/ufs/ufs_lookup.c
          src/sys/ufs/ufs/ufs_vnops.c
          src/sys/ufs/ufs/ufs_wapbl.c

        So either it does not fix the problem for 5.1.2 or there is another bug 
related popen() if
        there is no pclose().
        The zombi processes are there due to the fact that the call to pclose() 
is missing.
        I tend to the assumption, that the rename fix for 5.1_STABLE does not 
fix the problem in 5.1.2.
        (or perhaps does not fix the problem at all ...)
        Is it possible that the rename bug is still present in tmpfs?
        No patch has been applied to any tmpfs-related files.

        Accedently the debug files of amanda are in /tmp and are lost after 
reboot, so I don't know the
        famaous last debug-words of runtar.

>How-To-Repeat:
        Setup amanda client as backup system, try to backup some filesystems 
and wait.
        It will happen not all the time, but - at least for us - once a month.
>Fix:
        Not known till now. Event not realy known who to go on with further 
debugging.

>Unformatted:
        
        


Home | Main Index | Thread Index | Old Index