Subject: port-m68k/35099: pthread programs core on m68k
To: None <port-m68k-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: None <stix@stix.id.au>
List: netbsd-bugs
Date: 11/23/2006 07:10:00
>Number:         35099
>Category:       port-m68k
>Synopsis:       pthread programs core on m68k
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-m68k-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Nov 23 07:10:00 +0000 2006
>Originator:     Paul Ripke
>Release:        NetBSD 4.99.4 (-current 20061122ish)
>Organization:
>Environment:
	
	
System: NetBSD kitt.stix.org.au 4.99.4 NetBSD 4.99.4 (KITT) #0: Tue Nov 21 22:16:31 EST 2006 stix@zion.stix.org.au:/export/netbsd/current/obj.mac68k/export/netbsd/current/src/sys/arch/mac68k/compile/KITT mac68k

Architecture: m68k
Machine: mac68k
>Description:
Many pthread programs get SIGILL after a while. They appear to need to
have > 1 LWP (ie. not just switching in userspace). Since named(8) is now
threaded, it regularly will die with a SIGILL.

>How-To-Repeat:

Using "fblckgen" from http://stix.id.au/wiki/iotools as a simple-ish test
(it only has two threads, for starters):

ksh$ PTHREAD_DEBUGLOG=1 time ./fblckgen -ab 4k -c 0 | cat > /dev/null
time: Command terminated abnormally.
       11.90 real         2.48 user         3.70 sys

The "cat" above is required to get NLWP>1. Unfortunately, gdb cores trying
to analyse the core:

ksh$ gdb fblckgen fblckgen.core 
GNU gdb 5.3nb1
...
Core was generated by `fblckgen'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /usr/lib/libpthread.so.0...done.
Loaded symbols for /usr/lib/libpthread.so.0
Reading symbols from /usr/lib/libc.so.12...done.
Loaded symbols for /usr/lib/libc.so.12
Reading symbols from /usr/libexec/ld.elf_so...done.
Loaded symbols for /usr/libexec/ld.elf_so
#0  0x049ffbe4 in ?? ()
(gdb) thr app all bt

Thread 3 (Thread 22 ()):
#0  0x04025284 in pthread__locked_switch () from /usr/lib/libpthread.so.0
#1  0x06bffb78 in ?? ()
Memory fault (core dumped)

The debuglog always ends the same (with different addresses):

ksh$ debuglog -k | tail -20
(up 0x4e00000) sigev val 88880020
(up 0x4e00000) switching to 0xffe00000 (uc: U 0xffffb200 pc: 4025284)
(recycle 0xffe00000) recycling 0x4e00000
(up 0x4e00000) type 5 LWP 2 ev 0 intr 1
(fi 0x4e00000) victim 2 0x6a00000(1) lockholder 1
(rl 0x4e00000) entered
(rl 0x4e00000) intqueue 0x6a00000
(rl 0x4e00000) victim 0x6a00000 (uc T 0x6bffb6c) normal spinlocks: 1
(rl 0x4e00000) starting chain 0x6a00000 (uc T 0x6bffb6c pc 4029d08 sp 6bfff6c)
(rl 0x4e00000) returned from chain
(rl 0x4e00000) intqueue 0x6a00000
(rl 0x4e00000) victim 0x6a00000 (uc U 0x6bffb78) normal heldlock: 0x6690 switchto: 0xffe00000 (uc 0xffffb200 pc 4025284)
(rl 0x4e00000) exiting
(up 0x4e00000) sigev val 88880020
(up 0x4e00000) switching to 0xffe00000 (uc: U 0xffffb200 pc: 4025284)
(recycle 0xffe00000) recycling 0x4e00000
(up 0x4e00000) type 2 LWP 3 ev 1 intr 0
(up 0x4e00000) blocker 2 0xffe00000(1)
(up 0x4e00000) switching to 0x6a00000 (uc: U 0x6bffb78 pc: 4025284)
(recycle 0x6a00000) recycling 0x4e00000

Previously, with what was tagged as netbsd-4, before gcc4, etc, gdb would
get the following out of the core:

Thread 3 (Thread 22 ()):
#0  0x04023174 in pthread__locked_switch () from /usr/lib/libpthread.so.0
#1  0x06bffb70 in ?? ()
#2  0x040283b2 in pthread_cond_wait () from /usr/lib/libpthread.so.0
#3  0x00003548 in makeBlocks (dummy=0x0) at fblckgen.c:234
#4  0x040296ec in pthread_create () from /usr/lib/libpthread.so.0

Thread 2 (LWP 1):
#0  0x040584c2 in write () from /usr/lib/libc.so.12
#1  0x04022fca in write () from /usr/lib/libpthread.so.0
#2  0x000031be in main (argc=65536, argv=0x0) at fblckgen.c:179

Thread 1 (LWP 2):
#0  0x049ffbe4 in ?? ()
#1  0x040283b2 in pthread_cond_wait () from /usr/lib/libpthread.so.0
#2  0x00003548 in makeBlocks (dummy=0x0) at fblckgen.c:234
#3  0x040296ec in pthread_create () from /usr/lib/libpthread.so.0
#0  0x049ffbe4 in ?? ()

Which is odd, since the process only has 2 pthreads. The address
0x049ffbe4 appears to be bogus, and different cores all feature
a similar address.

I believe this problem is already known, but I couldn't find a PR
specifically for this issue.

>Fix:

Unknown.

>Unformatted: