NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/47524: make(1) loops in Job_CatchOutput()



>Number:         47524
>Category:       bin
>Synopsis:       make(1) loops in Job_CatchOutput()
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Feb 02 14:15:01 +0000 2013
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current source date 2013.01.29.01.05.57
>Organization:
>Environment:
System: NetBSD 6.0 amd64 (build host)
Architecture: amd64
Machine: amd64
>Description:

I have been running automated builds of NetBSD/i386 several times per
day for a number of years now.  The current build host is a quad-core
NetBSD/amd64 6.0 system, building with -j 4.

Recently, one of these builds hung with two make(1) processes stuck in
endless loops.  They have now used more than 24 hours of CPU each and
are still running as I write this.

Running ktruss against the looping processes shows them making
repeated calls to read() that always returned EAGAIN.  Attaching
with gdb shows them looping in Job_CatchOutput() in job.c.

Since this is the first time I have seen this particular failure in
many thousands of builds, I suspect it was caused by a recent change.
On Jan 26, usr.bin/make/job.c 1.65 was committed, containing the
following change to Job_CatchOutput() (among other changes):

@@ -2046,7 +2050,8 @@
     if (nready < 0 || readyfd(&childExitJob)) {
        char token = 0;
        nready -= 1;
-       (void)read(childExitJob.inPipe, &token, 1);
+       while (read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN)
+               continue;
        if (token == DO_JOB_RESUME[0])
            /* Complete relay requested from our SIGCONT handler */
            JobRestartJobs();

I believe the "while" loop added by this change is where make is now
looping.

Some highlights from a transcript of the debugging session follow.

   $ ps -glaxw|grep make
   101 10825 22842     0  85  0 10804  1176 pipe_rd S+   ttyp0     0:00.00 grep 
make 
   101 26878 22842     0  43  0 30680  7528 -       T    ttyp0     0:06.34 gdb 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake 
   101  2312  7193 34745  17 10  7704  2268 -       ON   pts/3  1648:33.01 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_ 
DISTRIBUTION_DONE release 
   101 20821  9556     0  80 10 94744 78508 pipe_wr IN   pts/3     0:01.71 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake 
   101 26400 28328 34745  17 10  7704  1976 -       ON+  pts/3  1650:08.03 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release 

   $ top
   load averages:  1.99,  1.97,  1.92;               up 27+19:38:26      
13:31:14
   77 processes: 66 sleeping, 3 stopped, 4 zombie, 4 on CPU
   CPU0 states:  0.0% user,  0.0% nice,  2.2% system, 28.9% interrupt, 68.8% 
idle
   CPU1 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% 
idle
   CPU2 states:  0.0% user, 23.2% nice, 76.8% system,  0.0% interrupt,  0.0% 
idle
   CPU3 states:  0.0% user, 22.4% nice, 77.6% system,  0.0% interrupt,  0.0% 
idle
   Memory: 8702M Act, 4272M Inact, 48K Wired, 27M Exec, 8290M File, 81M Free
   Swap: 16G Total, 1917M Used, 14G Free

     PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
   26400 gson      15   10  7704K 1976K CPU/2     27.7H 99.02% 99.02% nbmake
    2312 gson      15   10  7704K 2268K CPU/3     27.7H 99.02% 99.02% nbmake
   [...]

   $ ktruss -p 26400
    26400      1 nbmake   read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
    26400      1 nbmake   read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
    26400      1 nbmake   read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
   [...]

   $ proctree
   [...]
    | |           \-+- 19473 gson time ./build.sh -j 4 -D 
/tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R 
/bracket/i386/test/2013.01.29.01.05.57/release -T 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O 
/tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U 
release iso-image 
    | |             \-+- 28328 gson sh ./build.sh -j 4 -D 
/tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R 
/bracket/i386/test/2013.01.29.01.05.57/release -T 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O 
/tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U 
release iso-image 
    | |               \-+- 26400 gson 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release 
    | |                 \-+= 07193 gson sh 
    | |                   \-+- 02312 gson 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_ 
DISTRIBUTION_DONE release 
    | |                     \-+= 09556 gson sh 
    | |                       \-+- 20821 gson 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake 
    | |                         |--= 12759 gson (sh)
    | |                         |--= 21395 gson (sh)
    | |                         |--= 21444 gson (sh)
    | |                         \--= 29829 gson (sh)
   [...]

   $ gdb /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
   GNU gdb (GDB) 7.3.1
   Copyright (C) 2011 Free Software Foundation, Inc.
   License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
   This is free software: you are free to change and redistribute it.
   There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
   and "show warranty" for details.
   This GDB was configured as "x86_64--netbsd".
   For bug reporting instructions, please see:
   <http://www.gnu.org/software/gdb/bugs/>...
   Reading symbols from 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake...(no debugging 
symbols found)...done.
   (gdb) attach 2312
   Attaching to program: 
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake, process 2312
   Reading symbols from /usr/lib/libc.so.12...(no debugging symbols 
found)...done.
   Loaded symbols for /usr/lib/libc.so.12
   Reading symbols from /usr/libexec/ld.elf_so...(no debugging symbols 
found)...done.
   Loaded symbols for /usr/libexec/ld.elf_so
   0x00007f7ff783912a in read () from /usr/lib/libc.so.12
   (gdb) where
   #0  0x00007f7ff783912a in read () from /usr/lib/libc.so.12
   #1  0x000000000040afad in Job_CatchOutput ()
   #2  0x000000000040ff2c in Make_Run ()
   #3  0x000000000040e729 in main ()

>How-To-Repeat:

Do lots of builds until one of them hangs.

Alternatively, since the processes are still running, I may be able to
perform additional diagnostics on request, if asked quickly enough.

>Fix:



Home | Main Index | Thread Index | Old Index