NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
bin/47524: make(1) loops in Job_CatchOutput()
>Number: 47524
>Category: bin
>Synopsis: make(1) loops in Job_CatchOutput()
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Feb 02 14:15:01 +0000 2013
>Originator: Andreas Gustafsson
>Release: NetBSD-current source date 2013.01.29.01.05.57
>Organization:
>Environment:
System: NetBSD 6.0 amd64 (build host)
Architecture: amd64
Machine: amd64
>Description:
I have been running automated builds of NetBSD/i386 several times per
day for a number of years now. The current build host is a quad-core
NetBSD/amd64 6.0 system, building with -j 4.
Recently, one of these builds hung with two make(1) processes stuck in
endless loops. They have now used more than 24 hours of CPU each and
are still running as I write this.
Running ktruss against the looping processes shows them making
repeated calls to read() that always returned EAGAIN. Attaching
with gdb shows them looping in Job_CatchOutput() in job.c.
Since this is the first time I have seen this particular failure in
many thousands of builds, I suspect it was caused by a recent change.
On Jan 26, usr.bin/make/job.c 1.65 was committed, containing the
following change to Job_CatchOutput() (among other changes):
@@ -2046,7 +2050,8 @@
if (nready < 0 || readyfd(&childExitJob)) {
char token = 0;
nready -= 1;
- (void)read(childExitJob.inPipe, &token, 1);
+ while (read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN)
+ continue;
if (token == DO_JOB_RESUME[0])
/* Complete relay requested from our SIGCONT handler */
JobRestartJobs();
I believe the "while" loop added by this change is where make is now
looping.
Some highlights from a transcript of the debugging session follow.
$ ps -glaxw|grep make
101 10825 22842 0 85 0 10804 1176 pipe_rd S+ ttyp0 0:00.00 grep
make
101 26878 22842 0 43 0 30680 7528 - T ttyp0 0:06.34 gdb
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
101 2312 7193 34745 17 10 7704 2268 - ON pts/3 1648:33.01
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_
DISTRIBUTION_DONE release
101 20821 9556 0 80 10 94744 78508 pipe_wr IN pts/3 0:01.71
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
101 26400 28328 34745 17 10 7704 1976 - ON+ pts/3 1650:08.03
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release
$ top
load averages: 1.99, 1.97, 1.92; up 27+19:38:26
13:31:14
77 processes: 66 sleeping, 3 stopped, 4 zombie, 4 on CPU
CPU0 states: 0.0% user, 0.0% nice, 2.2% system, 28.9% interrupt, 68.8%
idle
CPU1 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100%
idle
CPU2 states: 0.0% user, 23.2% nice, 76.8% system, 0.0% interrupt, 0.0%
idle
CPU3 states: 0.0% user, 22.4% nice, 77.6% system, 0.0% interrupt, 0.0%
idle
Memory: 8702M Act, 4272M Inact, 48K Wired, 27M Exec, 8290M File, 81M Free
Swap: 16G Total, 1917M Used, 14G Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
26400 gson 15 10 7704K 1976K CPU/2 27.7H 99.02% 99.02% nbmake
2312 gson 15 10 7704K 2268K CPU/3 27.7H 99.02% 99.02% nbmake
[...]
$ ktruss -p 26400
26400 1 nbmake read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
26400 1 nbmake read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
26400 1 nbmake read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
[...]
$ proctree
[...]
| | \-+- 19473 gson time ./build.sh -j 4 -D
/tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R
/bracket/i386/test/2013.01.29.01.05.57/release -T
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O
/tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U
release iso-image
| | \-+- 28328 gson sh ./build.sh -j 4 -D
/tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R
/bracket/i386/test/2013.01.29.01.05.57/release -T
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O
/tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U
release iso-image
| | \-+- 26400 gson
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release
| | \-+= 07193 gson sh
| | \-+- 02312 gson
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_
DISTRIBUTION_DONE release
| | \-+= 09556 gson sh
| | \-+- 20821 gson
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
| | |--= 12759 gson (sh)
| | |--= 21395 gson (sh)
| | |--= 21444 gson (sh)
| | \--= 29829 gson (sh)
[...]
$ gdb /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64--netbsd".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake...(no debugging
symbols found)...done.
(gdb) attach 2312
Attaching to program:
/tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake, process 2312
Reading symbols from /usr/lib/libc.so.12...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libc.so.12
Reading symbols from /usr/libexec/ld.elf_so...(no debugging symbols
found)...done.
Loaded symbols for /usr/libexec/ld.elf_so
0x00007f7ff783912a in read () from /usr/lib/libc.so.12
(gdb) where
#0 0x00007f7ff783912a in read () from /usr/lib/libc.so.12
#1 0x000000000040afad in Job_CatchOutput ()
#2 0x000000000040ff2c in Make_Run ()
#3 0x000000000040e729 in main ()
>How-To-Repeat:
Do lots of builds until one of them hangs.
Alternatively, since the processes are still running, I may be able to
perform additional diagnostics on request, if asked quickly enough.
>Fix:
Home |
Main Index |
Thread Index |
Old Index