Re: Random lockups on an email server - possibly kern/50168

To: tech-kern%NetBSD.org@localhost
Subject: Re: Random lockups on an email server - possibly kern/50168
From: Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost>
Date: Sun, 3 Apr 2016 14:36:11 +0000

   Date: Sun, 3 Apr 2016 09:51:08 -0400
   From: "D'Arcy J.M. Cain" <darcy%NetBSD.org@localhost>

   This led me to the following PR.

   http://gnats.netbsd.org/39016

   There is a bit of discussion and then it was closed with "This
   particular problem has been fixed. Other problems that lead to "tstile
   syndrome" still exist, because "tstile syndrome" is any generic
   deadlock."  It doesn't say what the fix was.  Could this be some sort
   of code regression?

Every mutex in the kernel is supposed to be held for at most some
constant duration.  When someone tries to an acquire a mutex that is
already held, it will wait with wchan `tstile'.  There are hundreds or
thousands of different mutexes in any given system -- a bug with any
one of them could manifest that way.  

Was your system completely locked up and unresponsive, or just the
services that mattered?  Can you get a stack trace from crash(8) for
the processes that are wedged?  If not, can you enter ddb, e.g. by
typing C-A-ESC, and do it there?

From either crash(8) or ddb, you can list the processes with `show
proc' and get a stack trace for any individual one with `bt 0t<pid>'.
(`0t' is the notation for decimal; ddb reads input as hexadecimal by
default, for whatever reason.)

   I am copying tech-kern as we seem to be getting deeper into the
   kernel.  Replies set there as well.

   Meanwhile I am running the following script, a modification of one
   suggested by Robert Elz.

   ...
       case "${wchan}" in
         tstile*)  x="`ps -p "${pid}" | grep tstile`"
                   if [ "X$x" = "X" ]; then continue; fi
                   dt=`date`
                   echo "TSTILE: ${dt} $x"
                   ;;

If you can get a stack trace out of crash(8), that would be more
helpful.  Maybe something like:

printf 'bt 0t%d\n' "${pid}" | crash

Usually the culprit is *not* the process or thread that is stuck in
tstile, but that stack trace will help to find what mutex is at issue.

References:
- Re: Random lockups on an email server - possibly kern/50168
  - From: D'Arcy J.M. Cain

Prev by Date: Re: Random lockups on an email server - possibly kern/50168
Next by Date: Re: missing SDT_PROVIDER_DEFINE(sdt)
Previous by Thread: Re: Random lockups on an email server - possibly kern/50168
Next by Thread: Re: Random lockups on an email server - possibly kern/50168
Indexes:

Home | Main Index | Thread Index | Old Index