netbsd-bugs: kern/34101: ltsleep during panic hangs system

Subject: kern/34101: ltsleep during panic hangs system
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 07/28/2006 03:15:00

>Number:         34101
>Category:       kern
>Synopsis:       ltsleep during panic hangs system
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 28 03:15:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD panix3.panix.com 3.0 NetBSD 3.0 (PANIX-FIVE) #0: Fri Apr 14 21:05:29 EDT 2006  root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-FIVE i386
Architecture: i386
Machine: i386
>Description:

The top of ltsleep() contains this:

        /*
         * XXXSMP
         * This is probably bogus.  Figure out what the right
         * thing to do here really is.
         * Note that not sleeping if ltsleep is called with curlwp == NULL
         * in the shutdown case is disgusting but partly necessary given
         * how shutdown (barely) works.
         */
        if (cold || (doing_shutdown && (panicstr || (l == NULL)))) {
                /*
                 * After a panic, or during autoconfiguration,
                 * just give interrupts a chance, then just return;
                 * don't run any other procs or panic below,
                 * in case this is the idle process and already asleep.
                 */

The problem with that is that, if the system is panicking and trying
to reboot (which may include an attempt to sync disks), and a kernel
thread that loops calling ltsleep to wait for work (e.g., aiodoned, or
i386's MD apm_thread) gets woken up, it will run forever and the
system will never succeed in rebooting.

However, it appears to be like that for a reason, and thus that the
correct solution is not to just yank it out and try to sleep normally.


PR port-i386/33353 was opened to the specific instance of this problem
with apm_thread, in which special case it might be reasonable to have
the affected thread just exit if it's woken during a panic -- but that
seems like not the right solution somehow (even if it'd work).

>How-To-Repeat:

This happens most of the time when a host at Panix experiences a panic;
enough that we've had to locally modify swwdog(4) to pass RB_NOSYNC and
use it as a workaround.

>Fix:

That's what I'm filing this PR to find out.  A somewhat distasteful
workaround is noted above.