lib/59784: dlopening and dlclosing libpthread is broken

To: lib-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: lib/59784: dlopening and dlclosing libpthread is broken
From: campbell+netbsd%mumble.net@localhost
Date: Sat, 22 Nov 2025 16:20:00 +0000 (UTC)
>Number:         59784
>Category:       lib
>Synopsis:       dlopening and dlclosing libpthread is broken
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Nov 22 16:20:00 +0000 2025
>Originator:     Taylor R Campbell
>Release:        current, 11, 10, 9, ...
>Organization:
Locked and Unloaded LLC
>Environment:
>Description:

	A program that dlopens (a library linked against) libpthread
	and then dlcloses it can find itself in a pretty pickle with
	mysterious symptoms like this:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000079bbe310cccc in ?? ()
#0  0x000079bbe310cccc in ?? ()
#1  0x000079bbe2e9847c in __deregister_frame_info_bases () from /usr/lib/libgcc_s.so.1
#2  0x000079bbe2e86365 in __do_global_dtors_aux () from /usr/lib/libgcc_s.so.1
#3  0x000079bbe311ac00 in ?? ()
#4  0x000079bbe2e99a79 in _fini () from /usr/lib/libgcc_s.so.1
#5  0x000079bbe3585120 in atexit_handler_stack () from /usr/lib/libc.so.12
#6  0x00007f7ff709fbe1 in _rtld_call_initfini_function (mask=0x7f7fff539130, func=0x79bbe2e99a70 <_fini>) at /home/riastradh/netbsd/11/src/libexec/ld.elf_so/rtld.c:152
#7  _rtld_call_fini_function (obj=0x79bbe2e9ddf0, mask=0x7f7fff539130, cur_objgen=4) at /home/riastradh/netbsd/11/src/libexec/ld.elf_so/rtld.c:167
#8  0x00007f7ff70a06a6 in _rtld_call_fini_functions (force=1, mask=0x7f7fff539130) at /home/riastradh/netbsd/11/src/libexec/ld.elf_so/rtld.c:213
#9  _rtld_exit () at /home/riastradh/netbsd/11/src/libexec/ld.elf_so/rtld.c:431
#10 0x000079bbe32c895f in __cxa_finalize (dso=dso@entry=0x0) at /home/riastradh/netbsd/11/src/lib/libc/stdlib/atexit.c:222
#11 0x000079bbe32c853b in exit (status=status@entry=0) at /home/riastradh/netbsd/11/src/lib/libc/stdlib/exit.c:60
#12 0x000079bbe3592b90 in pass (ctx=0x79bbe359e860 <Current>) at /home/riastradh/netbsd/11/src/external/bsd/atf/dist/atf-c/tc.c:337
#13 0x000079bbe35931d5 in atf_tc_run (tc=0x792168 <atfu_dlopen_tc>, resfile=<optimized out>) at /home/riastradh/netbsd/11/src/external/bsd/atf/dist/atf-c/tc.c:1041
#14 0x000079bbe359000e in atf_tp_run (tp=tp@entry=0x7f7fff5392c0, tcname=<optimized out>, resfile=<optimized out>) at /home/riastradh/netbsd/11/src/external/bsd/atf/dist/atf-c/tp.c:205
#15 0x000079bbe358fb95 in run_tc (exitcode=<synthetic pointer>, p=0x7f7fff5392e0, tp=0x7f7fff5392c0) at /home/riastradh/netbsd/11/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:510
#16 controlled_main (exitcode=<synthetic pointer>, add_tcs_hook=0x78fad8 <atfu_tp_add_tcs>, argv=<optimized out>, argc=<optimized out>) at /home/riastradh/netbsd/11/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:580
#17 atf_tp_main (argc=<optimized out>, argv=<optimized out>, add_tcs_hook=add_tcs_hook@entry=0x78fad8 <atfu_tp_add_tcs>) at /home/riastradh/netbsd/11/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:610
#18 0x000000000078fcb6 in main (argc=<optimized out>, argv=<optimized out>) at /home/riastradh/netbsd/11/src/tests/lib/libpthread/dlopen/t_dlopen.c:163
#19 0x000000000078f4eb in ___start (cleanup=<optimized out>, ps_strings=0x7f7fff539fe0) at /home/riastradh/netbsd/11/src/lib/csu/common/crt0-common.c:375
#20 0x00007f7ff70a68d0 in ?? () from /usr/libexec/ld.elf_so
#21 0x0000000000000005 in ?? ()
#22 0x00007f7fff539968 in ?? ()
#23 0x00007f7fff539971 in ?? ()
#24 0x00007f7fff53998b in ?? ()
#25 0x00007f7fff5399ae in ?? ()
#26 0x00007f7fff5399c9 in ?? ()
#27 0x0000000000000000 in ?? ()

	Setting a breakpoint on __deregister_frame_info_bases and
	single-stepping through it reveals that the crash is trying to
	jump into code in libpthread.so that no longer exists, after
	dlclose, in order to call __libc_mutex_lock via PLT.  Why is it
	trying to jump there?

	What happened is:

	1. The program dlopened (a library linked against) libpthread.

	2. The program called pthread_mutex_lock -- or rather,
	   __libc_mutex_lock, renamed via #define in <pthread.h>.

	3. The symbol __libc_mutex_lock has two definitions:

	   (a) A weak definition in libc.so -- the no-op thread stub.
	   (b) A strong definition in libpthread.so -- the real one.

	   Lazy binding of the symbol chooses the strong one, so the
	   entry for __libc_mutex_lock in the .got.plt is bound to
	   libpthread.so's definition, as shown by `info proc mappings'
	   and single-stepping in gdb:

(gdb) info proc mappings
...
      0x7ee838cfb000     0x7ee838d03000     0x8000     0x7000  r-x CNPD /lib/libpthread.so.1.5
...
(gdb) display/i $pc
1: x/i $pc
=> 0x7ee838a8a402 <__deregister_frame_info_bases+4>:    push   %r12
(gdb) si
...
(gdb) si
0x00007ee838a8a477 in __deregister_frame_info_bases ()
   from /usr/lib/libgcc_s.so.1
1: x/i $pc
=> 0x7ee838a8a477 <__deregister_frame_info_bases+121>:
    call   0x7ee838a78150 <__libc_mutex_lock@plt>
(gdb) si
0x00007ee838a78150 in __libc_mutex_lock@plt () from /usr/lib/libgcc_s.so.1
1: x/i $pc
=> 0x7ee838a78150 <__libc_mutex_lock@plt>:
    jmp    *0x17f42(%rip)        # 0x7ee838a90098 <__libc_mutex_lock%got.plt@localhost>
(gdb) x/xg $rip + 6 + 0x17f42
0x7ee838a90098 <__libc_mutex_lock%got.plt@localhost>:     0x00007ee838cfeccc
(gdb) si
pthread_mutex_lock (ptm=0x7ee838a90400 <object_mutex>)
    at /home/riastradh/netbsd/11/src/lib/libpthread/pthread_mutex.c:204
1: x/i $pc
=> 0x7ee838cfeccc <pthread_mutex_lock>:
    mov    0x92b5(%rip),%rax        # 0x7ee838d07f88

	   Note that 0x7ee838cfeccc lies in the interval
	   [0x7ee838cfb000,0x7ee838d03000) where libpthread.so is
	   mapped.

	4. dlclose unmapped everything in libpthread.so -- including the
	   pages of instructions that the .got.plt entry for
	   __libc_mutex_lock now points to, and dlclose has no
	   mechanism to _unbind_ this.

	5. The next thing that tried to call __libc_mutex_lock jumped
	   into oblivion where libpthread.so used to be.  In the test
	   case above, that happened to be in some mysterious code path
	   at program exit, but it could just as well have been, say,
	   one of the stdio(3) functions taking a FILE lock.

(gdb) si
0x00007ee838a8a477 in __deregister_frame_info_bases ()
   from /usr/lib/libgcc_s.so.1
1: x/i $pc
=> 0x7ee838a8a477 <__deregister_frame_info_bases+121>:
    call   0x7ee838a78150 <__libc_mutex_lock@plt>
(gdb) si
0x00007ee838a78150 in __libc_mutex_lock@plt () from /usr/lib/libgcc_s.so.1
1: x/i $pc
=> 0x7ee838a78150 <__libc_mutex_lock@plt>:
    jmp    *0x17f42(%rip)        # 0x7ee838a90098 <__libc_mutex_lock%got.plt@localhost>
(gdb) si
0x00007ee838cfeccc in ?? ()
1: x/i $pc
=> 0x7ee838cfeccc:      <error: Cannot access memory at address 0x7ee838cfeccc>

	Why doesn't RTLD_LOCAL limit the scope of libpthread.so's
	__libc_mutex_lock definition so only those .got.plt entries for
	objects that dlclose is unloading will point to the
	libpthread.so one, and any .got.plt entries for objects in the
	global namespace will get the libc.so weak one?

	=> Because the library that the test dlopens, which is linked
	   against libpthread.so, is _also_ linked against libgcc_s.so,
	   which is already marked with -Wl,-z,nodelete -- and
	   libgcc_s.so's .got.plt entry for __libc_mutex_lock is
	   resolved in the RTLD_LOCAL scope and bound to
	   libpthread.so's __libc_mutex_lock.  If we remove libgcc_s.so
	   (by not using LIBISCXX=yes in the test library -- not sure
	   why we're using that anyway), the symptom goes away.

>How-To-Repeat:

	cd /usr/tests/lib/libpthread/dlopen
	atf-run | atf-report

	Caveat: This no longer works as a test case for this particular
	bug in HEAD, because __deregister_frame_info_bases has changed
	to avoid taking a lock with __libc_mutex_lock.  Need to
	construct a test case that still works in HEAD in spite of
	those changes.

>Fix:

	Add to lib/libpthread/Makefile:

	LDADD+=		-Wl,-z,nodelete

	This prevents rtld from actually unloading libpthread.

	The same is probably needed for any library that provides
	strong definitions of a symbol that is still used when the
	library isn't loaded, via a weak definition from some other
	source -- like __libc_mutex_lock.

	It's a dark corner of ELF wizardry that we probably don't use
	much outside of libpthread.so but I can't rule out the
	possibility that someone has dabbled in such nefarious magic
	elsewhere.
Prev by Date: PR/59711 CVS commit: [netbsd-9] src/common/dist/zlib
Next by Date: Re: kern/59727 (wsmux changes in April 2025 seem to have broken kqueue on wsmux)
Previous by Thread: PR/59711 CVS commit: [netbsd-9] src/common/dist/zlib
Next by Thread: Re: kern/59724 (if_rge(4) driver needs an update)
Indexes:
Home | Main Index | Thread Index | Old Index