tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Anyone recall the dreaded tstile issue?



    Date:        Sat, 16 Jul 2022 00:48:59 -0400 (EDT)
    From:        Mouse <mouse%Rodents-Montreal.ORG@localhost>
    Message-ID:  <202207160448.AAA09997%Stone.Rodents-Montreal.ORG@localhost>

  | That's what I was trying to do with my looking at "X is tstiled waiting
  | for Y, who is tstiled waiting for Z, who is..." and looking at the
  | non-tstiled process(se) at the ends of those chains.

That can sometimes help, but this is a difficult ussue to debug, as
often the offender is long gone before anyone notices.

  | My best guess at the moment is that there is a deadlock loop where
  | something tries to touch the puffs filesystem, the user process forks a
  | child as part of that operation, and the child gets locked up trying to
  | access the puffs filesystem.

That is possible, as is the case where locking is carried out
improperly (I lock a then try to lock b, you lock b then try to
lock a) - but those are the easier cases to find.

A more common case, I believe, is

		func()
		{
			lock(something);
			/*
			 * do some work
			 */
			 if (test for something strange) {
				/*
				 * this should not happen
				 */
				return EINVAL;
			}
			/*
			 * more stuff
			 */
			 unlock(something),
			 return answer,
		}

where I am sure you can what's missing in this short segment ...  real
code is typically much messier, and the locks not always that explicit,
they can be acquired/released as side effects of other function calls.

The function (func here) which causes the problem is no longer
active, no amount of stack tracing will find it.  The process
which called it might not even still exist, it might have
received the error return, and exited.

Finding this kind of thing requires very careful and thorough
code reading, analysing every lock, and making sure that lock
gets released, somewhere, on every possible path after it is taken.
The best you can really hope for from examining the wedged system
is to find which lock is (usually "might be") the instigator of it all.
That can help narrow the focus of code investigation.

Mouse, start with the code you added ... make sure there are
no problems like this buried in it somewhere (your own code, and
everything it calls).   If that ends up finding nothing, then
the best course if action might be to use a fairly new kernel.
Some very good people (none of whom is me, so I can lather praise)
have done some very good work in fixing most if the issues we
used to have.  I haven't seen a tstile lockup in ages, and I used
to quite often (fortunately mostly ones that affected comparatively
little, but over time, things get more and more clogged, until a
reboot - whuch can rarely be clean in this state - is required).

kre


Home | Main Index | Thread Index | Old Index