tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Random lockups on an email server - possibly kern/50168
Date: Sun, 3 Apr 2016 09:51:08 -0400
From: "D'Arcy J.M. Cain" <darcy%NetBSD.org@localhost>
This led me to the following PR.
http://gnats.netbsd.org/39016
There is a bit of discussion and then it was closed with "This
particular problem has been fixed. Other problems that lead to "tstile
syndrome" still exist, because "tstile syndrome" is any generic
deadlock." It doesn't say what the fix was. Could this be some sort
of code regression?
Every mutex in the kernel is supposed to be held for at most some
constant duration. When someone tries to an acquire a mutex that is
already held, it will wait with wchan `tstile'. There are hundreds or
thousands of different mutexes in any given system -- a bug with any
one of them could manifest that way.
Was your system completely locked up and unresponsive, or just the
services that mattered? Can you get a stack trace from crash(8) for
the processes that are wedged? If not, can you enter ddb, e.g. by
typing C-A-ESC, and do it there?
From either crash(8) or ddb, you can list the processes with `show
proc' and get a stack trace for any individual one with `bt 0t<pid>'.
(`0t' is the notation for decimal; ddb reads input as hexadecimal by
default, for whatever reason.)
I am copying tech-kern as we seem to be getting deeper into the
kernel. Replies set there as well.
Meanwhile I am running the following script, a modification of one
suggested by Robert Elz.
...
case "${wchan}" in
tstile*) x="`ps -p "${pid}" | grep tstile`"
if [ "X$x" = "X" ]; then continue; fi
dt=`date`
echo "TSTILE: ${dt} $x"
;;
If you can get a stack trace out of crash(8), that would be more
helpful. Maybe something like:
printf 'bt 0t%d\n' "${pid}" | crash
Usually the culprit is *not* the process or thread that is stuck in
tstile, but that stack trace will help to find what mutex is at issue.
Home |
Main Index |
Thread Index |
Old Index