Anyone recall the dreaded tstile issue?

To: tech-kern%netbsd.org@localhost
Subject: Anyone recall the dreaded tstile issue?
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Fri, 15 Jul 2022 19:46:58 -0400 (EDT)

Some time back, I recall seeing people talking on the lists about some
sort of discouragingly common issue with processes getting stuck in
tstile waits.  (I've tried to scare up the relevant list mail on
mail-archive.netbsd.org, so far to no avail.)

I can't recall how long ago that was, nor what version(s) it was under.
But I recently starting having one of my machines do something very
similar: occasionally (under circumstances largely uncharacterized so
far) one of my machines wedges, with most of userland stuck in
apparently permanent tstile waits.  ("Apparently permanent" = I've left
the machine for sometimes as much as multiple hours, without effect.)

The machine in question is still running my 5.2 derivative.

Anything that doesn't leave the kernel still works.  Or, let me be
precise: one of the machine's primary functions is packet routing, and
that still works, provided it's forwarding between real-hardware
interfaces; it is inference on my part that other purely-in-kernel
things would work.  But, because the problem seems to be userland
processes getting stuck waiting for locks, it would not surprise me for
pure-kernel things like packet forwarding to keep working.

Debugging this has been slow, because it wedges only occasionally.  I
once had it happen twice in the same day, but, based on unscientific
feel, I would say the expected MTBF is about a week.  That really slows
down the edit-build-test-debug cycle.

I set the machine up with serial console; breaking to ddb works (that's
how I could tell what processes were blocked on).  I added debugging
code, suitable for calling from ddb.  It just recently wedged again,
and the results are puzzling enough I wanted to run them past anyone
here with the leisure and inclination to offer suggestions, whether
simply based on what I've found or based on memories of the "dreaded"
tstile issue from the past.

My debugging simply dumps out the state of turnstiles.  I then captured
that with the machine on the other end of the console serial line.
Between ps output and my debugging output, I can then track down what
process is blocked waiting on what other process.

In today's hang, I found:

Many processes
	blocked on
7599     1 3   3         4           d3e682a0                 nc tstile
	blocked on
13479    1 3   3         4           d1383a00              multi tstile
	blocked on
557      1 3   0        84           d2251500        mailwrapper puffsrpl

Many processes
	blocked on
17440    1 3   1         4           d1e38a20          bozohttpd tstile
	blocked on
10493    1 3   1         4           d68ef800          bozohttpd tstile
	blocked on
17512    1 3   2         4           d1f3bce0          xferwatch tstile
	blocked on
3985     1 3   1        84           d1c3ba80          bozohttpd puffsrpl

This "explains" why this is a relatively new thing; this machine has
been using a puffs filesystem for only a month and a half or so
(approximately since about May 9th).

So I went looking for the puffs backing process in the ps listing,
fully expecting to find it stuck in tstile.  I didn't:

8169     1 3   3        84           d286e840              gitfs puffsget
23930    1 3   1        84           d5a4f0e0              gitfs piperd
21881    1 3   1        84           d1c3b800              gitfs select

gitfs is something I wrote that uses puffs to provide a filesystem view
of git repos.

There was also

23176    1 3   1        84           d1f3b7e0                git puffsrpl

which surprised me; I would not expect a git process to be having
anything to do with anything under the puffs mount point and thus with
no reason to wait on puffsrpl.  But then, I wouldn't expect mailwrapper
to, either - the puffs filesystem forms part of my /export/ftp area,
which usually is not touched by anything but bozohttpd and ftpd.  It's
not a question of trying to page out a dirty page, either (that being
the only plausible reason that comes to mind for arm's-length processes
to be accessing the puffs filesystem); the puffs mount point is
read-only (and synchronous, noexec, nodev, union, though I wouldn't
expect those to matter).

The gitfs process does fork git subprocesses under some circumstances;
a filesystem access that does that normally will hang until the git
subprocess finishes.  I don't know whether process 23176 was forked by
gitfs or not; if it was, that could in theory have produced the
deadlock.

However, there is a second data point.  I have another machine, not
exposed to the world, which I was doing some gitfs development on.  I
also ran, on that machine, some small games which I then displayed over
X connections to my desktop machine.  On a few occasions, when I was
doing nothing but playing one of those games, the game process would
wedge in a tstile wait partway through a simple motion animation.
While I wasn't keeping careful records (it was only today that I had
any reason to think puffs was involved), I think there was no puffs
mountpoint active at least some of those times, and even if there was,
it most certainly wasn't being actively accessed and thus not forking
git subprocesses.

gitfs uses puffs, but not libpuffs - I can talk about why, if anyone
cares, but it would be difficult to keep it from veering into a rant,
and I see little point.

I'm going to be trying to come up with a way to capture additional
useful information, but I would welcome any thoughts anyone may have,
even if only of the "I suspect you might want to look at the $THING"
sort.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: Anyone recall the dreaded tstile issue?
  - From: Brian Buhrow
- Re: Anyone recall the dreaded tstile issue?
  - From: Emmanuel Dreyfus

Prev by Date: Re: Fix for PR kern/56713
Next by Date: Re: Anyone recall the dreaded tstile issue?
Previous by Thread: Fix for PR kern/56713
Next by Thread: Re: Anyone recall the dreaded tstile issue?
Indexes:

Home | Main Index | Thread Index | Old Index