tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Anyone recall the dreaded tstile issue?



>> My best guess at the moment is that there is a deadlock loop where
>> something tries to touch the puffs filesystem, the user process
>> forks a child as part of that operation, and the child gets locked
>> up trying to access the puffs filesystem.
> That is possible, [...]

> A more common case, I believe, is [...failing to unlock in an error
> path...]

> The function [...] which causes the problem is no longer active, no
> amount of stack tracing will find it.  The process which called it
> might not even still exist, it might have received the error return,
> and exited.

I find the notion of a nonexistent process holding a lock disturbing,
but of course that's just a human-layer issue.

> Finding this kind of thing requires very careful and thorough code
> reading, analysing every lock, and making sure that lock gets
> released, somewhere, on every possible path after it is taken.

Well...if I wanted to debug that, I would probably grow each lock by
some kind of indication (a PC value, or __FILE__ and __LINE__) of where
it was last locked.  Then, once the culprit lock is found....

> The best you can really hope for from examining the wedged system is
> to find which lock is (usually "might be") the instigator of it all.
> That can help narrow the focus of code investigation.

At the moment, I'd be happy with that much.  But the system has only
the one puffs mount, which is off under /export/ftp, not anywhere that
is, for example, expected to be on anyone's $PATH, and all the
X-waiting-on-Y-waiting-on-Z chains end up with someone waiting on
puffsrpl.  And the puffs userland processes show no indication of being
stuck in the "holding a lock that's not getting released" sense.  So
there probably is nothing here that could be caught by, for example,
in-kernel deadlock detection.

I am basically certain it has _something_ to do with the puffs
filesystem, because of the puffsrpl waits and because it started
happening shortly after I added the puffs mount.  The real puzzle, for
me, in this latest hang are why/how the mailwrapper and git processes
ended up waiting for puffsrpl.  I will allocate a piece of disk for a
kernel coredump, so I can do detailed post-mortem on a wedged system.
(The machine's main function is to forward packets; I can't really keep
it in ddb for hours while I pore over details of a lockup.)

I will also add timeouts in the puffs userland code, so that if a
forked git process takes too long, it is nuked, with the access that
led to it returning an error - and, of course, logging all over the
place.

I'm also going to change ddb's ps listing to include PPID; in this last
hang, I would have liked to have known whether the git process were a
child of the gitfs process.

I will also take that "other system" I mentioned, make a puffs mount,
and then start playing that game.  If I can get it to tstile in a
reasonable time frame, it will greatly accelerate debugging this.

> Mouse, start with the code you added ... make sure there are no
> problems like this buried in it somewhere (your own code, and
> everything it calls).

I haven't touched the puffs kernel code.  Of course, that doesn't mean
it doesn't have any such issues, but it makes it seem less likely to
me.  While it doesn't rule out such problems in any of my other
changes, it makes that too less likely; it would have to be a bug that
remained latent until I started using puffs....

> If that ends up finding nothing, then the best course if action might
> be to use a fairly new kernel.

Possibly, but unless a new kernel can be built with 5.2's compiler, I
run right back into the licensing issue that's the reason I froze at
5.2 to begin with.  I'd also have to port at least a few of my kernel
changes to the new kernel.

I may have to resort to that, but I'd much rather avoid it; even if the
licensing turns out to not be an issue, it would be a lot of work.

> I haven't seen a tstile lockup in ages, [...]

I never saw them at all until I started playing with puffs.  The major
reason I'm reluctant to suspect your "lock held by a nonexistent
process" theory (presumably with the culprit somewhere puffs-related)
here is those two processes waiting on puffsrpl which I would not
expect to be touching the puffs mountpoint at all.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index