NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/53202: Kernel hangs running t_ptrace_wait:resume1 test



>Number:         53202
>Category:       kern
>Synopsis:       Kernel hangs running t_ptrace_wait:resume1 test
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Apr 22 17:15:00 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2017.12.02.22.51.22
>Organization:

>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

Since the commit of kern_lwp.c 1.191, executing the resume1 test case
of the t_ptrace_wait test can cause the operating system to hang.  The
problem was first reported in

  http://mail-index.netbsd.org/current-users/2017/12/04/msg032841.html

but at that time it was not yet clear whether just the test framework
was hanging, or the kernel; it has now become clear that it is in fact
the kernel.

The problem is also mentioned in passing in comments to PR 51995, but
since the subject of that PR is "ptrace(2) PT_RESUME is not reliable"
rather than the more serious issue of the kernel hanging, I'm filing
this separate PR about the latter issue.

The problem occurs on multiple architectures under both qemu (i386,
amd64, sparc) and gxemul (hpcmips).  I have also reproduced it on
physical amd64 hardware after disabling all but one CPU using cpuctl.

I currently have a hung i386 system attached to a remote kernel
debugger under qemu using the procedure documented at

  https://wiki.netbsd.org/kernel_debugging_with_qemu/

The backtrace varies as the kernel appears to be in a loop
involving multiple functions, but here's a typical one:

(gdb) bt
#0  sleepq_remove (sq=0xc20d8b98, l=0xc20e3aa0)
    at /usr/src/sys/kern/kern_sleepq.c:137
#1  0xc0bd5e0d in sleepq_unsleep (l=0xc20e3aa0, cleanup=true)
    at /usr/src/sys/kern/kern_sleepq.c:347
#2  0xc0b97b90 in cv_unsleep (l=0xc20e3aa0, cleanup=true)
    at /usr/src/sys/kern/kern_condvar.c:227
#3  0xc0bb37bb in lwp_unsleep (l=0xc20e3aa0, cleanup=true)
    at /usr/src/sys/kern/kern_lwp.c:1526
#4  0xc0bd5ad6 in sleepq_block (timo=0, catch_p=true)
    at /usr/src/sys/kern/kern_sleepq.c:259
#5  0xc0b97c9d in cv_wait_sig (cv=0xc20d8b98, mtx=0xc2674800)
    at /usr/src/sys/kern/kern_condvar.c:272
#6  0xc0bb18a6 in lwp_wait (l=0xc20e3aa0, lid=0, departed=0x0, exiting=true)
    at /usr/src/sys/kern/kern_lwp.c:648
#7  0xc0ba810b in exit_lwps (l=0xc20e3aa0) at /usr/src/sys/kern/kern_exit.c:636
#8  0xc0ba7478 in exit1 (l=0xc20e3aa0, exitcode=0, signo=1)
    at /usr/src/sys/kern/kern_exit.c:223
#9  0xc0bd4d70 in sigexit (l=0xc20e3aa0, signo=1)
    at /usr/src/sys/kern/kern_sig.c:2106
#10 0xc0bd4554 in postsig (signo=1) at /usr/src/sys/kern/kern_sig.c:1904
#11 0xc0bb3880 in lwp_userret (l=0xc20e3aa0)
    at /usr/src/sys/kern/kern_lwp.c:1562
#12 0xc0169046 in mi_userret (l=0xc20e3aa0) at /usr/src/sys/sys/userret.h:94
#13 0xc01690c1 in userret (l=0xc20e3aa0) at ./machine/userret.h:80
#14 0xc01692af in syscall (frame=0xc9398fa8)
    at /usr/src/sys/arch/x86/x86/syscall.c:168
#15 0xc01006a9 in Xsyscall ()
(gdb)

If I place a breakpoint on line 637 of kern_exit.c, it gets hit
repeatedly, but a breakpoint on line 641 is never hit:

(gdb) l
632              * behind us or there may even be new LWPs created.  Therefore, a
633              * full retry is required on error.
634              */
635             while (p->p_nlwps > 1) {
636                     if (lwp_wait(l, 0, NULL, true)) {
637                             goto retry;
638                     }
639             }
640     
641             KERNEL_LOCK(nlocks, l);
(gdb) 

The lwp_wait() call is returning -3 = ERESTART, which originates in
sleepq_sigtoerror():

#0  sleepq_sigtoerror (l=0xc20e3aa0, sig=9)
    at /usr/src/sys/kern/kern_sleepq.c:400
#1  0xc0bd5bd5 in sleepq_block (timo=0, catch_p=true)
    at /usr/src/sys/kern/kern_sleepq.c:293
#2  0xc0b97c9d in cv_wait_sig (cv=0xc20d8b98, mtx=0xc2674800)
    at /usr/src/sys/kern/kern_condvar.c:272
#3  0xc0bb18a6 in lwp_wait (l=0xc20e3aa0, lid=0, departed=0x0, exiting=true)
    at /usr/src/sys/kern/kern_lwp.c:648
#4  0xc0ba810b in exit_lwps (l=0xc20e3aa0) at /usr/src/sys/kern/kern_exit.c:636
#5  0xc0ba7478 in exit1 (l=0xc20e3aa0, exitcode=0, signo=1)
    at /usr/src/sys/kern/kern_exit.c:223

If there are other gdb commands I can run to help debug this, let me
know.

See also PR 52892, "Tests hang on MIPS", for another problem that
appeared with the same commit.

>How-To-Repeat:

Because the test case that triggers the bug has been disabled, and
another change has been committed that also keeps the test case from
triggering the bug, you need to apply the following two patches to
reproduce the issue in -current as of source date 2018.04.19.21.21.44:

--- src/tests/lib/libc/sys/t_ptrace_wait.c.orig	2018-04-15 03:19:23.000000000 +0300
+++ src/tests/lib/libc/sys/t_ptrace_wait.c	2018-04-17 10:26:17.000000000 +0300
@@ -6444,7 +6444,6 @@
 	atf_tc_expect_fail("PR kern/51995");
 
 	// Hangs with qemu
-	ATF_REQUIRE(0 && "In order to get reliable failure, abort");
 
 	SYSCALL_REQUIRE(msg_open(&fds) == 0);
 
Index: src/tests/lib/libc/sys/msg.h
diff -c src/tests/lib/libc/sys/msg.h:1.2 src/tests/lib/libc/sys/msg.h:1.1
*** src/tests/lib/libc/sys/msg.h:1.2	Tue Mar 13 16:45:36 2018
--- src/tests/lib/libc/sys/msg.h	Mon Apr  3 00:44:00 2017
***************
*** 70,76 ****
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);
  
! //	printf("Send %s\n", info);
  	rv = write(fds->pfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 70,76 ----
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);
  
! 	printf("Send %s\n", info);
  	rv = write(fds->pfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
***************
*** 88,94 ****
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);
  
! //	printf("Send %s\n", info);
  	rv = write(fds->cfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 88,94 ----
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);
  
! 	printf("Send %s\n", info);
  	rv = write(fds->cfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
***************
*** 106,112 ****
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);
  
! //	printf("Wait %s\n", info);
  	rv = read(fds->pfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 106,112 ----
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);
  
! 	printf("Wait %s\n", info);
  	rv = read(fds->pfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
***************
*** 124,130 ****
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);
  
! //	printf("Wait %s\n", info);
  	rv = read(fds->cfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 124,130 ----
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);
  
! 	printf("Wait %s\n", info);
  	rv = read(fds->cfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;

Then build an i386 release, boot it in qemu, and run the commands

# cd /usr/tests/lib/libc/sys/
# atf-run ./t_ptrace_wait4 >log 2>&1 &

Within a few minutes, the system becomes unresponsive.  You may still
get a new shell prompt if you hit enter, but attempting to run "ls"
will hang.

>Fix:

Since the problem appeared with kern_lwp.c 1.191, the obvious fix
would be to revert that commit (and its pullup).



Home | Main Index | Thread Index | Old Index