Subject: deceased session cows
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 03/29/2002 20:24:08
I was fiddling with some console redirection stuff and my machine just
fell over hard.  Before sending-pr, though, I'd thought I'd ask if this
is known-and-fixed, because this _is_ a somewhat old kernel.  I did
poke at things with google and with the PR search interface and didn't
find anything that looked promising, but I admittedly could have just
missed it.

I got a kernel core, and I built the kernel with full debugging (I've
been doing that routinely, and now I'm glad).  The relevant piece of
the call stack is sigexit -> exit1 -> fdfree -> closef -> vn_closefile
-> vn_close -> ufsspec_close -> spec_close -> ptcclose -> ttymodem ->
psignal -> BOOM.  The relevant call in ttymodem is

	psignal(tp->t_session->s_leader, SIGHUP);

and tp->t_session->s_leader is 0x52beef - highly suspicious.  However,
because fdfree() is called well before the session is torn down in
exit1(), the session should still be valid at this point - if the
terminal's session's leader is the process exiting.

I suspect this has something to do with revoke().  To explain why, I
need to describe in a bit more detail how I provoked it.  Fortunately,
it appears very repeatible.

The machine is a SPARCstation 10.  I log in on the console, using kbd0.
I start an X session.  Once it's up, I run a daemon I just wrote, which
accepts IPC connections and feeds console output to them; it grabs a
pty and makes it the console when it has clients, releasing it when it
doesn't.  Then I run a window that contains a connection to the daemon;
the daemon takes the console.  At this point I can "echo foo >
/dev/console" and get a "foo" in the window; I can su and see the su
log message in that window.  It seems to handle output correctly.

Then I switch to kbd1 and type RETURN.  If the console daemon is not
running, this just generates a newline as input on /dev/console, which
produces another shell prompt (the X startup script I use returns after
starting the session - the machine is dual-headed).  In this case, the
shell promptly falls over; I need to investigate why, but that's not
directly relevant.  In the console window, I see output from the shell
dying, then I see another console login prompt, and then I see init log
a message that getty is repeating too quickly on /dev/console.

At this point I kill the console daemon (with kill from a shell) and
the machine instantly drops into ddb.  ps on the coredump reveals that
the console daemon is in the process of exiting, and its stack trace
from postmortem gdb matches the one I see with ddb's trace on the live
system.  The crash is an alignment fault in kernel mode, coming from
tp->t_session->s_leader being junk; psignal does

        if (p->p_flag & P_TRACED)

which (quite correctly) traps if p is 0x52beef.  (I've also seen
0x12beef.)

I don't yet know more.  I'll be investigating; I'm sending this in the
hope that someone who already knows the relevant codepaths can identify
the problem quicker than I can grovel through and figure it out myself.
I'll send it pr once I have a less vague idea what's going on, unless
it's a "fixed in -current".

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B