NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/46955: process deadlock (tstile) runing amanda sendsize - rename bug in tmpfs ???
>Number: 46955
>Category: kern
>Synopsis: process deadlock (tstile) runing amanda sendsize - rename bug
>in tmpfs ???
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Sep 14 08:10:00 +0000 2012
>Originator: Dr. Wolfgang Stukenbrock
>Release: NetBSD 5.1.2
>Organization:
Dr. Nagler & Company GmbH
>Environment:
System: NetBSD e003 5.1.2 NetBSD 5.1.2 (NSW-E003) #0: Mon Aug 27 10:33:31 CEST
2012 wgstuken@e003:/usr/src/sys/arch/amd64/compile/NSW-E003 amd64
Architecture: x86_64
Machine: amd64
>Description:
The system freezes over nigth "sometimes" - no login possible, reset
required.
I've added some cron jobs that will report vmstat, ps and df output to
syslog that run every30 minutes.
They report the following:
Sep 11 03:00:00 e003 vmstat: procs memory page
disks faults cpu
Sep 11 03:00:00 e003 vmstat: r b w avm fre flt re pi po fr sr
c0 w0 in sy cs us sy id
Sep 11 03:00:00 e003 vmstat: 0 0 0 2552356 198788 69 0 0 0 0 0
0 3 6 300 60 0 0 100
Sep 11 03:00:01 e003 vmstat: 0 0 0 2552384 198760 11 0 0 0 0 0
0 1 3 1081 91 0 0 100
Sep 11 03:00:02 e003 vmstat: 0 0 0 2552384 198760 0 0 0 0 0 0
0 2 4 36 45 0 0 100
Sep 11 03:01:01 e003 ps: UID PID PPID NLWP WCHAN STAT TIME RSZ
VSZ COMMAND
Sep 11 03:01:01 e003 ps: 0 0 0 44 - OKl 37:52.18 19156
0 [system]
Sep 11 03:01:01 e003 ps: 0 1 0 1 wait Is 0:00.01 1044
6520 init
Sep 11 03:01:01 e003 ps: 0 117 1 5 sigwait Isl 1:40.33 77952
109276 /usr/sbin/named -c /var/named/named.conf -4
Sep 11 03:01:01 e003 ps: 0 165 1 1 kqueue Ss 0:02.33 1068
6544 /usr/sbin/syslogd -s
Sep 11 03:01:01 e003 ps: 0 172 1 1 nanoslp Ss 0:06.33 1004
6504 /usr/sbin/ipmon -Dps
Sep 11 03:01:01 e003 ps: 0 188 1 1 select Ss 0:00.59 1228
6532 /usr/sbin/rpcbind -l
Sep 11 03:01:01 e003 ps: 0 201 1 1 select Ss 0:08.96 1036
5464 /usr/sbin/ypbind
Sep 11 03:01:01 e003 ps: 1002 297 775 1 select I 0:00.01 3372
35760 sshd: wgstuken@ttyp0
Sep 11 03:01:01 e003 ps: 0 311 1 1 pause Ss 0:13.74 5420
11320 /usr/sbin/ntpd
Sep 11 03:01:01 e003 ps: 0 313 1 1 select Is 0:00.00 2080
24808 /usr/sbin/sshd
Sep 11 03:01:01 e003 ps: 0 380 313 1 netio Is 0:00.01 4564
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 0 407 313 1 netio Is 0:00.01 4564
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 12 459 462 1 kqueue I 0:01.46 2540
12144 qmgr -l -t unix -u
Sep 11 03:01:01 e003 ps: 1002 460 407 1 select I 0:00.00 3372
35760 sshd: wgstuken@ttyp2
Sep 11 03:01:01 e003 ps: 0 462 1 1 kqueue Ss 0:02.53 2368
12028 /usr/libexec/postfix/master
Sep 11 03:01:01 e003 ps: 500 509 1 1 wait Is 0:00.00 796
4416 /etc/pkg/sbin/watch_server -b -p /var/run/nswSW.pid -c
/etc/pkg/watch-server.cfg
Sep 11 03:01:01 e003 ps: 500 518 509 1 select I 0:00.06 956
4416 /etc/pkg/sbin/watch_server -b -p /var/run/nswSW.pid -c
/etc/pkg/watch-server.cfg
Sep 11 03:01:01 e003 ps: 0 542 1 1 kqueue Is 0:00.01 1544
7616 /usr/sbin/inetd -l
Sep 11 03:01:01 e003 ps: 0 575 1 1 nanoslp Ss 0:01.64 1056
5468 /usr/sbin/cron
Sep 11 03:01:01 e003 ps: 0 601 313 1 netio Is 0:00.01 4564
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 1002 666 792 1 select I 0:00.00 3372
35760 sshd: wgstuken@ttyp1
Sep 11 03:01:01 e003 ps: 0 712 1 1 select Ss 0:06.41 17604
176944 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 0 743 1 3 select Ssl 0:02.15 5360
35056 /usr/pkg/sbin/bacula-fd -c /etc/pkg/bacula/bacula-fd.conf -g bacula
Sep 11 03:01:01 e003 ps: 0 775 313 1 netio Is 0:00.02 4508
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 0 792 313 1 netio Is 0:00.01 4564
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 0 875 313 1 netio Is 0:00.01 4564
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 1002 894 380 1 select I 0:00.00 3372
35760 sshd: wgstuken@ttyp5
Sep 11 03:01:01 e003 ps: 1002 1005 875 1 select I 0:00.00 3372
35760 sshd: wgstuken@ttyp4
Sep 11 03:01:01 e003 ps: 1002 1115 601 1 select I 0:00.00 3372
35760 sshd: wgstuken@ttyp3
Sep 11 03:01:01 e003 ps: 0 1214 313 1 netio Is 0:00.01 4588
35760 sshd: wgstuken [priv]
Sep 11 03:01:01 e003 ps: 1002 1360 1214 1 select I 0:00.01 3372
35760 sshd: wgstuken@ttyp6
Sep 11 03:01:01 e003 ps: 802 1440 712 1 semwait I 0:00.73 10000
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 2426 712 1 semwait I 0:01.13 9944
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 2585 712 1 semwait I 0:00.34 9804
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 2607 712 1 socket I 0:00.00 4440
115164 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 3683 712 1 semwait I 0:01.00 9920
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 4193 712 1 semwait I 0:00.88 10048
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 6004 712 1 semwait I 0:00.90 9968
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 7014 712 1 semwait I 0:01.02 10004
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 7061 712 1 semwait I 0:01.00 9948
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 7202 712 1 semwait I 0:01.11 9940
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 0 7543 15040 1 tstile D 0:00.03 1424
10132 runtar NSW-backup /usr/pkg/bin/gtar --create --file /dev/null --directory
/var/log --one-file-system --listed-incremental
/var/amanda/gnutar-lists/e003_var_log_0.new --sparse --ignore-failed-read
--totals --exclude-from /tmp/amanda/sendsize._var_log.20120910233003.exclude .
Sep 11 03:01:01 e003 ps: 802 7768 712 1 semwait I 0:00.98 10024
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 7929 712 1 semwait I 0:00.41 9960
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 0 10744 7543 1 - ZW 0:00.00 0
0 (sh)
Sep 11 03:01:01 e003 ps: 802 11307 712 1 semwait I 0:00.41 9924
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 800 11876 26376 1 wait I 0:00.00 1576
10160 /usr/pkg/libexec/sendsize amandad bsdtcp
Sep 11 03:01:01 e003 ps: 802 11955 712 1 kqueue S 0:00.28 9936
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 0 12002 575 1 piperd S 0:00.00 1200
5468 cron: running job
Sep 11 03:01:01 e003 ps: 802 13211 712 1 semwait I 0:00.95 9944
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 800 13326 11876 1 piperd I 0:00.00 1364
10192 /usr/pkg/libexec/sendsize amandad bsdtcp
Sep 11 03:01:01 e003 ps: 0 13686 15453 1 - ZW 0:00.00 0
0 (sh)
Sep 11 03:01:01 e003 ps: 802 14111 712 1 semwait I 0:01.02 10000
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 0 15029 25894 1 - ZW 0:00.00 0
0 (sh)
Sep 11 03:01:01 e003 ps: 800 15040 11876 1 piperd I 0:00.00 1364
10192 /usr/pkg/libexec/sendsize amandad bsdtcp
Sep 11 03:01:01 e003 ps: 802 15174 712 1 semwait I 0:01.07 9992
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 0 15453 19855 1 tstile D 0:00.03 1424
10132 runtar NSW-backup /usr/pkg/bin/gtar --create --file /dev/null --directory
/ --one-file-system --listed-incremental /var/amanda/gnutar-lists/e003__0.new
--sparse --ignore-failed-read --totals --exclude-from
/tmp/amanda/sendsize._.20120910233003.exclude .
Sep 11 03:01:01 e003 ps: 802 15908 712 1 semwait I 0:00.89 9920
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 500 16327 12002 1 wait Ss 0:00.00 1140
6664 /bin/sh -c /bin/ps -axwwo "uid pid ppid nlwp wchan state time rsz vsz
command" | logger -t ps -p local1.info
Sep 11 03:01:01 e003 ps: 802 16924 712 1 semwait I 0:00.97 9972
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 802 19410 712 1 semwait I 0:00.36 9928
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 800 19855 11876 1 piperd I 0:00.00 1352
10192 /usr/pkg/libexec/sendsize amandad bsdtcp
Sep 11 03:01:01 e003 ps: 12 21041 462 1 kqueue S 0:00.00 2432
12084 pickup -l -t fifo -u
Sep 11 03:01:01 e003 ps: 500 25050 16327 1 piperd S 0:00.00 804
4328 logger -t ps -p local1.info
Sep 11 03:01:01 e003 ps: 0 25894 13326 1 tstile D 0:00.00 1420
10132 runtar NSW-backup /usr/pkg/bin/gtar --create --file /dev/null --directory
/home.stand --one-file-system --listed-incremental
/var/amanda/gnutar-lists/e003_home.stand_0.new --sparse --ignore-failed-read
--totals --exclude-from /tmp/amanda/sendsize._home.stand.20120910233003.exclude
.
Sep 11 03:01:01 e003 ps: 800 26376 542 1 select I 0:00.00 1588
9156 amandad bsdtcp amdump amindexd amidxtaped
Sep 11 03:01:01 e003 ps: 802 28274 712 1 semwait I 0:00.91 10068
177968 /usr/pkg/sbin/httpd -k start
Sep 11 03:01:01 e003 ps: 500 28676 16327 1 - O 0:00.00 932
6552 /bin/ps -axwwo uid pid ppid nlwp wchan state time rsz vsz command
Sep 11 03:01:01 e003 ps: 0 404 530 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 530 656 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 558 874 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 656 558 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 662 404 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 874 297 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 268 872 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 591 268 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 607 891 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 851 591 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 872 607 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 891 666 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 469 915 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 599 469 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 849 1037 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 915 849 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 1037 460 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 161 476 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 357 803 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 476 1115 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 489 807 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 803 161 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 807 357 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 436 562 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 562 750 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 694 436 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 698 824 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 750 1005 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 824 694 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 321 767 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 393 1101 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 767 894 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 785 321 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 836 393 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 1101 785 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 1002 706 1360 1 pause Is 0:00.00 1232
4508 -csh
Sep 11 03:01:01 e003 ps: 0 1181 706 1 pause I 0:00.01 1204
4508 -csh -m
Sep 11 03:01:01 e003 ps: 0 1200 1181 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 1448 1450 1 ttyraw I+ 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 1450 1200 1 pause I 0:00.00 1204
4508 -sh (csh)
Sep 11 03:01:01 e003 ps: 0 497 1 1 wait I 0:00.00 1180
6664 /bin/sh /usr/pkg/bin/mysqld_safe --user=mysql --datadir=/var/mysql
--pid-file=/var/mysql/e003.pid --skip-thread-priority
Sep 11 03:01:01 e003 ps: 801 755 497 10 select Il 2:57.45 20124
82068 /usr/pkg/libexec/mysqld --basedir=/usr/pkg --datadir=/var/mysql
--user=mysql --skip-thread-priority --log-error=/var/mysql/e003.err
--pid-file=/var/mysql/e003.pid
Sep 11 03:01:01 e003 ps: 0 778 1 1 ttyraw Is+ 0:00.00 1136
6500 /usr/libexec/getty std.38400 console
Sep 11 03:02:00 e003 df: Filesystem 1K-blocks Used Avail %Cap
Mounted on
Sep 11 03:02:00 e003 df: /dev/raid0a 96247 46238 45197 50% /
Sep 11 03:02:00 e003 df: /dev/raid0f 16264494 59196 15392074 0%
/var
Sep 11 03:02:00 e003 df: /dev/raid0e 8132206 5899618 1825978 76%
/usr
Sep 11 03:02:00 e003 df: /dev/raid0g 30497948 164 28972888 0%
/nc-www
Sep 11 03:02:00 e003 df: /dev/raid0h 18737934 58 17800980 0%
/home.stand
Sep 11 03:02:00 e003 df: procfs 4 4 0 100%
/proc
Sep 11 03:02:00 e003 df: kernfs 1 1 0 100%
/kern
Sep 11 03:02:00 e003 df: tmpfs 1048576 736 1047840 0%
/tmp
Next output is at Sep 12 09:00:01 from vmstat.
The backup is started between 23:00 and 23:30 - the listing from 23:00
does not show
the amanda process, At 23:30 the state shown above is already reached.
At 03:15 the daily cronjob is started and I think that this one will
finaly lock up everything.
The runtar command is a binary from amanda software.
It will finaly execve to the gnutar binary and this has still not
happen.
In front of that it uses popen("<gnutar> --version 2>&1", "r") to
determine
the gnutar binary version.
It does a fgets from the stream and than simply forget FILE* - no
pclose() done on it - this looks
like a bug in amanda-client software ... (I've send a bug report for
too.)
It then does a rename() of the debug-output file - so the rename-bug
may also be relevant.
The rename() happens in /tmp - this is tmpfs.
The rename-bug-fix from 5.1_STABLE has been integrated into this kernel.
updated files:
src/sys/ufs/lfs/lfs_vnops.c
src/usr.bin/pmap/pmap.h
src/sys/ufs/ufs/inode.h
src/sys/ufs/ufs/ufs_extern.h
src/sys/ufs/ufs/ufs_lookup.c
src/sys/ufs/ufs/ufs_vnops.c
src/sys/ufs/ufs/ufs_wapbl.c
So either it does not fix the problem for 5.1.2 or there is another bug
related popen() if
there is no pclose().
The zombi processes are there due to the fact that the call to pclose()
is missing.
I tend to the assumption, that the rename fix for 5.1_STABLE does not
fix the problem in 5.1.2.
(or perhaps does not fix the problem at all ...)
Is it possible that the rename bug is still present in tmpfs?
No patch has been applied to any tmpfs-related files.
Accedently the debug files of amanda are in /tmp and are lost after
reboot, so I don't know the
famaous last debug-words of runtar.
>How-To-Repeat:
Setup amanda client as backup system, try to backup some filesystems
and wait.
It will happen not all the time, but - at least for us - once a month.
>Fix:
Not known till now. Event not realy known who to go on with further
debugging.
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index