NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/44500: 4.0 sa threaded apps hard hang netbsd-5 and HEAD kernels on some ports [cpu_setfunc() related]

>Number:         44500
>Category:       kern
>Synopsis:       4.0 sa threaded apps hard hang netbsd-5 and HEAD kernels on 
>some ports [cpu_setfunc() related]
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jan 31 20:40:00 +0000 2011
>Originator:     Chuck Cranor
>Release:        netbsd-5 branch, June 10th 2009 and later, also HEAD if SA 
>compat support is enabled (kern.no_sa_support=0)
Carnegie Mellon University
        NetBSD 5.0_STABLE (GENERIC-$Revision: 1.325 $) #8: Mon Jan 31 13:24:52 
EST 2011

        NetBSD 5.99.44 (GENERIC-$Revision: 1.338 $) #1: Sun Jan 30 13:20:09 EST 

Architecture: alpha
Machine: alpha

The problem was introduced to the netbsd-5 branch via netbsd-5 ticket
number 798:

So a NetBSD 5.0 kernel is OK, but a NetBSD 5.0.2 is not.

    The changes to vm_machdep.c appear to have removed the
call to cpu_setfunc() from cpu_lwp_fork() and replaced it
with the actual content of the old cpu_setfunc() function.
The net result here is that the behavior of cpu_lwp_fork() 
does not change, but it no longer calls cpu_setfunc().

    The old cpu_setfunc() is now replace with a new stripped 
down version that calls setfunc_trampoline() instead of 
lwp_trampoline()  [the s3 register is no longer setup or used]
The only thing that calls the cpu_setfunc() is now compat_sa.c
( cpu_lwp_fork() no longer calls it ).

    The main difference between the lwp_trampoline() and the new
setfunc_trampoline() is that the setfunc_trampoline() no longer
calls lwp_startup().   Removin the call to lwp_startup() causes
the alpha to hang hard if you run a 4.0 threaded app like "dig"...

    So, lwp_startup() does something that keeps the system from
hanging.   To figure out what that was, I started adding in bits
of lpw_startup() into the setfunc_trampoline() until the system
stopped hanging.   It turns out the two critical bits are:

xlwp_startup(struct lwp *prev, struct lwp *new)
        if (prev != NULL) {
                curcpu()->ci_mtx_count++;  /*YES*/
                prev->l_ctxswtch = 0;      /*YES*/

    Put that much of lwp_startup() back into setfunc_trampoline(), and 
the system no longer hangs when you run "dig"... a complete diff
that applies to a netbsd-5 branch checked out on date 10-Jun-2009
(e.g. with "cvs -q update -r netbsd-5 -dP -D 10-Jun-2009") is included
at the end.

    You need both the l_ctxswtch and ci_mtx_count statements.
If you comment out the "l_ctxswtch" statement, the system hangs
as soon as you run "dig".    If you comment out the ci_mtx_count
statement, the system runs "dig" (it prints an error message to
console) but then hangs when "dig" exits.   Couldn't get DDB in
either case.

    The hard hang occurs in mi_switch() ... the kernel gets stuck
in an endless loop here (i added the debugging line):

                 * We may need to spin-wait for if 'newl' is still
                 * context switching on another CPU.

               if (newl->l_ctxswtch != 0) {
                        u_int count;
                        count = SPINLOCK_BACKOFF_MIN;
                        while (newl->l_ctxswtch) {
printf("POINTA\n");  /*XXXCDC*/

it just prints "POINTA" endlessly.   Note my system only has one
CPU (so the case the comment is looking for does not apply).  Because
interrupts are disabled, it is not possible to break to DDB if you
are stuck in that while() loop, your system is hung

Looking at HEAD, the current state of the tree is not uniform:

arch    cpu_setfunc calls       does it call lpw_startup?  when changed?
------- ----------------------  ----------------------------------------
acorn26 lwp_trampoline          yes 
alpha   setfunc_trampoline      no (vm_machdep.1.100, 2009/06/01)
arm32   lwp_trampoline          yes
hppa    setfunc_trampoline      no (vm_machdep.c 1.36, 2009/06/03)
m68k    setfunc_trampoline      no (vm_machdep.c 1.28, 2009/05/30)
mips    setfunc_trampoline      no (vm_machdep.c 1.123, 2009/05/30)
powerpc setfunc_trampoline      no (vm_machdep.c 1.77, 2009/06/07)
sh3     lwp_setfunc_trampoline  no (never called lpw_startup?)
sparc   lwp_setfunc_trampoline  no (vm_machdep.c 1.100, 2009/05/29)
sparc64 lwp_setfunc_trampoline  no (vm_machep.c 1.89, 2009/05/30)
x86     lwp_trampoline          yes

the "no" ports are likely to have problems with compat_sa binaries,
I think.


        Find a NetBSD 4.0 binary that uses SA threads.  A statically
        linked version of /usr/bin/dig will do... here is an alpha one:

        boot system single user, enable SA compat code (if HEAD), run binary.

        >>> boot -file testin -fl s
        # mount -r /usr
        # sysctl -w kern.no_sa_support=0
        # /root/dig.static
        << system hangs, power cycle required to recover >>


This is just a work around, not a fix:

Index: arch/alpha/alpha/locore.s
RCS file: /cvsroot/src/sys/arch/alpha/alpha/locore.s,v
retrieving revision
diff -u -r1.113.10.1 locore.s
--- arch/alpha/alpha/locore.s   9 Jun 2009 17:38:38 -0000
+++ arch/alpha/alpha/locore.s   30 Jan 2011 03:47:33 -0000
@@ -752,6 +752,9 @@
  * Simplified version of above: don't call lwp_startup()
 LEAF_NOPROFILE(setfunc_trampoline, 0)
+       mov     v0, a0   /* NEW */
+       mov     s3, a1   /* NEW */
+       CALL(xlwp_startup)   /* NEW */
        mov     s0, pv
        mov     s1, ra
        mov     s2, a0
Index: arch/alpha/alpha/vm_machdep.c
RCS file: /cvsroot/src/sys/arch/alpha/alpha/vm_machdep.c,v
retrieving revision
diff -u -r1.96.30.1 vm_machdep.c
--- arch/alpha/alpha/vm_machdep.c       9 Jun 2009 17:38:39 -0000
+++ arch/alpha/alpha/vm_machdep.c       30 Jan 2011 03:47:33 -0000
@@ -228,6 +228,8 @@
            (u_int64_t)exception_return;        /* s1: ra */
        up->u_pcb.pcb_context[2] =
            (u_int64_t)arg;                     /* s2: arg */
+       up->u_pcb.pcb_context[3] =
+           (u_int64_t)l;                       /* s3: lwp */
        up->u_pcb.pcb_context[7] =
            (u_int64_t)setfunc_trampoline;      /* ra: assembly magic */
Index: kern/kern_lwp.c
RCS file: /cvsroot/src/sys/kern/kern_lwp.c,v
retrieving revision
diff -u -r1.126.2.2 kern_lwp.c
--- kern/kern_lwp.c     8 Mar 2009 03:15:36 -0000
+++ kern/kern_lwp.c     30 Jan 2011 03:48:08 -0000
@@ -706,6 +706,22 @@
+ * Called by MD code when a new LWP begins execution.  Must be called
+ * with the previous LWP locked (so at splsched), or if there is no
+ * previous LWP, at splsched.
+ */
+void xlwp_startup(struct lwp *prev, struct lwp *new);
+xlwp_startup(struct lwp *prev, struct lwp *new)
+       if (prev != NULL) {
+               curcpu()->ci_mtx_count++;  /*YES*/
+               prev->l_ctxswtch = 0;      /*YES*/
+       }
  * Exit an LWP.

Home | Main Index | Thread Index | Old Index