Subject: The old threaded app paging and dying problem
To: None <port-sparc@netbsd.org, port-sparc64@netbsd.org>
From: Geoff Adams <gadams@avernus.com>
List: port-sparc64
Date: 08/07/2006 02:05:58
I'm still running into this problem, even in -current (3.99.24). That  
is, any program that uses pthreads will die, sooner or later. As I  
understand it, this happens when some or all of the threaded program  
is paged in.

This makes it increasingly hard to use the otherwise ideally suited  
NetBSD/sparc{,64} as a server platform for some significant  
applications, such as email, where milters are highly desirable and  
inherently threaded, or web serving, where I want to run Ruby code on  
the back end. Fortunately, bind9 can be compiled without thread support.

What can we do about this issue? I assume it's still there because  
it's hard to reproduce in a test harness to find out just what's  
wrong. I couldn't find any recent traffic on the mailing lists about  
this issue. Has anybody looked into this problem recently? Are there  
any clues about where to look? It seems to affect only my sparc and  
sparc64 hosts, and not my alphas, so my first guess is that it's  
either in md code or it's something like an alignment problem in mi  
code that doesn't cause problems on many ports.

However, Chuck Silvers's post to port-macppc <http://mail- 
index.netbsd.org/port-macppc/2005/02/03/0001.html> a year and a half  
ago would indicate that this problem is not limited to the sparc  
ports. He refers to mycroft's 'ibpthread hacks,' which have long  
since been committed to the tree, and so appear in both the netbsd-3  
branch and the trunk. (Some of his changes are wrapped in '#ifdef  
PTHREAD_MLOCK_KLUDGE' and '#ifdef PTHREAD__DEBUG', so I was going to  
rebuild libpthread with those defined, but the default build of  
libpthread already defines PTHREAD_MLOCK_KLUDGE. And still, my  
threaded processes die.)

So, not knowing where to start, I ran 'ktrace /usr/pkg/sbin/named -u  
named -t /var/chroot/named -g'. It crashed some minutes later. The  
last lines of the 'kdump -R' looked like this:

   2257      5 named    0.000151492 CALL  setcontext(0x2afff480)
   2257      5 named    0.000043497 RET   setcontext JUSTRETURN
   2257      2 named    0.000906950 SAU   blocked, event= 
[<ctx=0x24fffe40, id=2, cpu=0>]
   2257      2 named    0.000124993 CALL  setcontext(0x217ff800)
   2257      2 named    0.000043997 RET   setcontext JUSTRETURN
   2257      2 named    0.000042498 CALL  sa_yield
   2257      2 named    0.003410810 SAU   unblocked, event= 
[<ctx=0x24fffe40, id=2, cpu=0>], intr=[<ctx=0x2afff108, id=5, cpu=0>]
   2257      2 named    0.000048497 RET   sa_yield JUSTRETURN
   2257      5 named    0.011603353 SAU   blocked, event= 
[<ctx=0x257ffe40, id=5, cpu=0>]
   2257      5 named    0.000243986 SAU   unblocked, event= 
[<ctx=0x257ffe40, id=5, cpu=0>], intr=[<ctx=0x2affec50, id=2, cpu=0>]
   2257      5 named    0.000135493 CALL  setcontext(0x2affec50)
   2257      5 named    0.000041998 RET   setcontext JUSTRETURN
   2257      5 named    0.000093994 PSIG  SIGSEGV SIG_DFL
   2257      3 named    0.000387979 RET   select -1 errno 4  
Interrupted system call
   2257      1 named    0.000122493 RET   __sigtimedwait -1 errno 87  
Operation Canceled

A second time, named ran for over an hour, and then died with a  
SIGBUS, rather than SIGSEGV:

18342      5 named    0.000153492 CALL  setcontext(0x2afff5f0)
18342      5 named    0.000047997 RET   setcontext JUSTRETURN
18342      2 named    0.000814954 SAU   blocked, event= 
[<ctx=0x23fffe40, id=2, cpu=0>]
18342      2 named    0.000121994 CALL  setcontext(0x217ff800)
18342      2 named    0.000046997 RET   setcontext JUSTRETURN
18342      2 named    0.000040998 CALL  sa_yield
18342      2 named    0.018910942 SAU   unblocked, event= 
[<ctx=0x23fffe40, id=2, cpu=0>], intr=[<ctx=0x2afff1d0, id=5, cpu=0>]
18342      2 named    0.000054997 RET   sa_yield JUSTRETURN
18342      5 named    0.029272363 SAU   blocked, event= 
[<ctx=0x247ffe40, id=5, cpu=0>]
18342      5 named    0.000246986 SAU   unblocked, event= 
[<ctx=0x247ffe40, id=5, cpu=0>], intr=[<ctx=0x2affed18, id=2, cpu=0>]
18342      5 named    0.000136992 CALL  setcontext(0x2affed18)
18342      5 named    0.000046498 RET   setcontext JUSTRETURN
18342      5 named    0.000059996 PSIG  SIGBUS SIG_DFL
18342      3 named    0.021391804 RET   select -1 errno 4 Interrupted  
system call
18342      1 named    0.000131993 RET   __sigtimedwait -1 errno 87  
Operation Canceled

How can I help solve this problem?

- Geoff