Re: kern/40594: gdb does not work on 5.0 RC2

To: ad%NetBSD.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost, pooka%iki.fi@localhost
Subject: Re: kern/40594: gdb does not work on 5.0 RC2
From: David Holland <dholland-bugs%netbsd.org@localhost>
Date: Sun, 22 Feb 2009 21:45:02 +0000 (UTC)

The following reply was made to PR kern/40594; it has been noted by GNATS.

From: David Holland <dholland-bugs%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: kern/40594: gdb does not work on 5.0 RC2
Date: Sun, 22 Feb 2009 21:41:22 +0000

 On Mon, Feb 09, 2009 at 08:15:01PM +0000, pooka%iki.fi@localhost wrote:
  > Somewhere between late 5.0_BETA and 5.0_RC (1 and 2) gdb stopped working.
  > Notably, my gdb is from Nov 2007.
  > >How-To-Repeat:
  > pain-rustique:1:~> gdb /bin/ls
  > GNU gdb 6.5
  > [snip]
  > 
  > (gdb) run
  > Starting program: /bin/ls
  > *hang*
  > 
  > >Fix:
  > It seems that executing ls ends up the "pause" wchan.  It is coming
  > from __sigsuspend14.  gdb, on the other hand, is doing wait4.
  > So I guess technically the program executed from is hanging, not gdb.

 The issue appears to be provoked by the shell spawned by gdb to start
 the inferior process; depending on what you have in your shell startup
 files the hang may or may not occur. In my case the problem seems to
 be tickled by

    setenv _UNAME `uname -s |& tr A-Z a-z`

 What seems to be happening is that the shell forks and then the fork
 runs a subprocess, and when the child shell exits, wait notifies gdb
 instead of the parent shell, so the child shell hangs around as a
 zombie, the parent shell (if *csh) blocks in sigsuspend waiting for a
 SIGCHLD it's not going to get, and gdb blocks in wait assuming
 something else is going to happen.

 In this run, process 14062 is gdb, 24361 is the parent shell (spawned
 by gdb), and 26479 is the child shell.

  24361      1 tcsh     CALL  read(8,0xbfbfad50,0x1000)
  14164      1 tr       CALL  exit(0)
  26479      1 tcsh     RET   __sigsuspend14 -1 errno 4 Interrupted system call
  26479      1 tcsh     PSIG  SIGCHLD caught handler=0x808463c mask=(2,20): 
code=CLD_EXITED child pid=14164, uid=32170,  status=0, utime=0, stime=0)
  26479      1 tcsh     CALL  setcontext(0xbfbf65b4)
  26479      1 tcsh     RET   write JUSTRETURN
  26479      1 tcsh     CALL  __wait450(0xffffffff,0xbfbf68a4,1,0xbfbf6854)
  26479      1 tcsh     RET   __wait450 14164/0x3754
  26479      1 tcsh     CALL  __wait450(0xffffffff,0xbfbf68a4,1,0xbfbf6854)
  26479      1 tcsh     RET   __wait450 -1 errno 10 No child processes

 So far, so good. The parent shell is sitting in read to collect the
 results from the backquotes; the child picks up the exit status of tr.

  26479      1 tcsh     CALL  __sigprocmask14(3,0xbfbf6900,0)
  26479      1 tcsh     RET   __sigprocmask14 0
  26479      1 tcsh     CALL  __sigprocmask14(0,0,0x80a4738)
  26479      1 tcsh     RET   __sigprocmask14 0
  26479      1 tcsh     CALL  exit(0)

 Now the child shell exits.

  14062      1 gdb      RET   __wait450 24361/0x5f29
  14062      1 gdb      CALL  ptrace(PT_GETREGS,0x5f29,0xbfbfe2ec,0)
  14062      1 gdb      RET   ptrace 0
  14062      1 gdb      CALL  ptrace(PT_CONTINUE,0x5f29,1,0x14)
  14062      1 gdb      RET   ptrace 0

 Now gdb picks up a wait result for the *parent* shell, which has not
 exited or done anything else that should cause this. This is
 apparently the exit notification for the child shell, messed up
 somehow.

 gdb apparently shrugs and tells the parent shell to continue.

  24361      1 tcsh     RET   read -1 errno 4 Interrupted system call
  24361      1 tcsh     CALL  read(8,0xbfbfad50,0x1000)
  24361      1 tcsh     GIO   fd 8 read 0 bytes
        ""
  24361      1 tcsh     RET   read 0
  24361      1 tcsh     CALL  close(8)
  24361      1 tcsh     RET   close 0

 The parent shell now drops out of read and closes its pipe...

  24361      1 tcsh     CALL  __sigprocmask14(1,0xbfbf6ca0,0xbfbf6cb0)
  24361      1 tcsh     RET   __sigprocmask14 0
  24361      1 tcsh     CALL  __sigsuspend14(0xbfbf6c90)

 ...and waits for a SIGCHLD from the child shell that it is never going
 to receive, because that exit result was misdirected above, or
 something.

  14062      1 gdb      CALL  __wait450(0xffffffff,0xbfbfe558,0,0)
  14062      1 gdb      RET   __wait450 RESTART

 and now gdb goes to sleep waiting for something to happen, which of
 course nothing will. This is where it hangs; the next thing in the
 trace is manual intervention via SIGKILL.

 I'm not sure if the child shell is being traced or not (one would
 expect that it would be, though) so it's not clear if what's happening
 is that the wrong process is being awakened from wait, that wait is
 reporting on the wrong process, or even just that the wrong pid is
 being returned, but it's pretty clear that wait is stuffed somehow.

 Unfortunately, find_stopped_child() is a maze of special cases and
 it's not clear what's going on inside it.

 -- 
 David A. Holland
 dholland%netbsd.org@localhost

Follow-Ups:
- Re: kern/40594: gdb does not work on 5.0 RC2
  - From: Antti Kantee

Prev by Date: Re: bin/40715 (ktruss output clearly wrong)
Next by Date: Re: port-vax/39182: vax port fails to build 20080715 -current sources
Previous by Thread: kern/40594: gdb does not work on 5.0 RC2
Next by Thread: Re: kern/40594: gdb does not work on 5.0 RC2
Indexes:

Home | Main Index | Thread Index | Old Index