Subject: kern/34101: ltsleep during panic hangs system
To: None <,,>
From: None <>
List: netbsd-bugs
Date: 07/28/2006 03:15:00
>Number:         34101
>Category:       kern
>Synopsis:       ltsleep during panic hangs system
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 28 03:15:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
PANIX Public Access Internet and UNIX, NYC
System: NetBSD 3.0 NetBSD 3.0 (PANIX-FIVE) #0: Fri Apr 14 21:05:29 EDT 2006 i386
Architecture: i386
Machine: i386

The top of ltsleep() contains this:

         * XXXSMP
         * This is probably bogus.  Figure out what the right
         * thing to do here really is.
         * Note that not sleeping if ltsleep is called with curlwp == NULL
         * in the shutdown case is disgusting but partly necessary given
         * how shutdown (barely) works.
        if (cold || (doing_shutdown && (panicstr || (l == NULL)))) {
                 * After a panic, or during autoconfiguration,
                 * just give interrupts a chance, then just return;
                 * don't run any other procs or panic below,
                 * in case this is the idle process and already asleep.

The problem with that is that, if the system is panicking and trying
to reboot (which may include an attempt to sync disks), and a kernel
thread that loops calling ltsleep to wait for work (e.g., aiodoned, or
i386's MD apm_thread) gets woken up, it will run forever and the
system will never succeed in rebooting.

However, it appears to be like that for a reason, and thus that the
correct solution is not to just yank it out and try to sleep normally.

PR port-i386/33353 was opened to the specific instance of this problem
with apm_thread, in which special case it might be reasonable to have
the affected thread just exit if it's woken during a panic -- but that
seems like not the right solution somehow (even if it'd work).


This happens most of the time when a host at Panix experiences a panic;
enough that we've had to locally modify swwdog(4) to pass RB_NOSYNC and
use it as a workaround.


That's what I'm filing this PR to find out.  A somewhat distasteful
workaround is noted above.