tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Random lockups on an email server - possibly kern/50168
On Mon, 28 Mar 2016 12:02:27 -0400
"D'Arcy J.M. Cain" <darcy%NetBSD.org@localhost> wrote:
> As far as I can tell I am seeing a total of 2GB of memory used by all
> processes and resident in memory but the system (top
> and /proc/meminfo) are telling me that 17GB of memory is in use.
> What's using the other 15GB?
Meanwhile, my system crashed again. I have taken to rebooting every
morning (better a controlled five minute down time than a minimum half
hour crash). Here is what was on the screen when it locked up.
load averages: 1.74, 2.53, 2.39; up 4+16:45:53 04:42:39
491 processes: 446 sleeping, 43 zombie, 2 on CPU
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 18G Act, 9227M Inact, 11M Wired, 86M Exec, 26G File, 12M Free
Swap: 32G Total, 32G Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
25248 root 85 0 148M 72M select/1 0:02 0.00% 0.00% perl
212 root 117 0 87M 52M tstile/2 1:02 0.00% 0.00% auth
8125 kontakt 117 0 64M 51M tstile/4 0:00 0.00% 0.00% imap
17669 ennis 117 0 59M 47M tstile/8 0:02 0.00% 0.00% imap
2550 mailman 117 0 134M 46M tstile/0 0:34 0.00% 0.00% python2.7
0 root 0 0 0K 45M CPU/15 51:28 0.00% 0.00% [system]
1305 mailman 117 0 136M 44M tstile/0 0:27 0.00% 0.00% python2.7
1691 mailman 117 0 134M 44M tstile/4 0:28 0.00% 0.00% python2.7
29932 www 85 0 362M 37M semwai/1 0:07 0.00% 0.00% httpd
17758 www 85 0 365M 34M semwai/5 0:11 0.00% 0.00% httpd
2143 www 85 0 362M 33M semwai/0 0:10 0.00% 0.00% httpd
2908 mailman 117 0 123M 32M tstile/8 0:25 0.00% 0.00% python2.7
1434 root 85 0 347M 30M select/1 0:05 0.00% 0.00% httpd
2718 sgh 85 0 43M 29M kqueue/0 0:14 0.00% 0.00% imap
12296 www 85 0 359M 29M semwai/7 0:06 0.00% 0.00% httpd
27886 www 85 0 356M 28M kqueue/1 0:03 0.00% 0.00% httpd
5943 www 85 0 357M 27M semwai/1 0:02 0.00% 0.00% httpd
25826 www 85 0 356M 26M semwai/1 0:02 0.00% 0.00% httpd
14331 www 85 0 352M 23M semwai/4 0:01 0.00% 0.00% httpd
2039 mailman 117 0 118M 23M tstile/1 0:25 0.00% 0.00% python2.7
1863 postgrey 85 0 82M 21M select/1 0:41 0.00% 0.00% perl
27179 moegross 117 0 32M 20M tstile/1 0:04 0.00% 0.00% imap
27262 root 117 0 96M 18M tstile/9 0:00 0.00% 0.00% python3.4
15158 root 85 0 95M 18M flt_no/8 0:00 0.00% 0.00% python3.4
1594 mailman 117 0 115M 16M tstile/8 0:24 0.00% 0.00% python2.7
1720 mailman 117 0 115M 16M tstile/1 0:23 0.00% 0.00% python2.7
2238 mailman 85 0 101M 15M select/1 0:00 0.00% 0.00% python2.7
26659 eref3 85 0 97M 15M flt_no/1 0:00 0.00% 0.00% python3.4
7355 root 85 0 148M 15M select/9 0:00 0.00% 0.00% perl
And my memory test:
Fri Apr 1 04:39:12 EDT 2016
PS: 2085092
PROC: 32033408
It was pointed out to me that the interesting bit was that so many
processes were waiting on tstile. This may not be a swap issue after
all, at least not directly.
"Those indicate a kernel lock problem of some kind. tstile is the wchan
used for a process sleeping on an internal lock - to debug this, you
need to find out which lock it is - and probably which locks they all
are. Some of that (in fact, probably most of it) is probably
legitimate, most likely there is one process there that is locking
something (trying to) which is never going to be unlocked - either
because of a deadlock with another of them, or because something in the
code simply missed an exit path and forgot to unlock. If that process
has other locks held, then eventually some other process is going to
want one of those locks, and it hangs, perhaps while holding more locks
- then some other process is going to need a lock that's already
locked, ...
"Eventually something that is important gets locked, and everything
stops working when processes try to get that important lock that some
process that is being blocked by one of the other less important locks
has held, and the system seems to freeze - actually it is probably
still working "correctly" - if only that one, original lock, was
released..."
This led me to the following PR.
http://gnats.netbsd.org/39016
There is a bit of discussion and then it was closed with "This
particular problem has been fixed. Other problems that lead to "tstile
syndrome" still exist, because "tstile syndrome" is any generic
deadlock." It doesn't say what the fix was. Could this be some sort
of code regression?
I am copying tech-kern as we seem to be getting deeper into the
kernel. Replies set there as well.
Meanwhile I am running the following script, a modification of one
suggested by Robert Elz.
while true
do
ps -ax -o pid= -o wchan= | while read pid wchan
do
case "${wchan}" in
tstile*) x="`ps -p "${pid}" | grep tstile`"
if [ "X$x" = "X" ]; then continue; fi
dt=`date`
echo "TSTILE: ${dt} $x"
;;
esac
done
sleep 1
done
--
D'Arcy J.M. Cain <darcy%NetBSD.org@localhost>
http://www.NetBSD.org/ IM:darcy%Vex.Net@localhost
Home |
Main Index |
Thread Index |
Old Index