Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Strange system behavior



(Resending without most of the original attachments, since that seems to have exceeded a message-size limit!)


I have a new machine in my farm, and it is exhibiting some very strange behavior.

Essentially, this machine does nothing more than 'build.sh release" several times daily for port-amd64. (See [1] for test results!)

About half of the time, the build fails due to some host utility receiving a "segmentation fault", and almost always it fails on the exact same command and at the exact same place in the build! But re-running the failed command interactively succeeds without any problem.

The command that fails most often is (manual line-breaks inserted)

   /test-bed/tools/bin/nbmandoc -Thtml -Oman=../html%S/%N.html \
   -Ostyle=../style.css /test-bed/src/gnu/usr.bin/gcc4/gcc/gcc.1 \
   > gcc.html1.tmp &&  mv gcc.html1.tmp gcc.html1

Gdb analysis of the core dump contains a rather scary warning about libc not being found at the expected address:

{105} gdb /test-bed/tools/bin/nbmandoc nbmandoc.core        GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64--netbsd"...(no debugging symbols found)

Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libc.so.12...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libc.so.12
Reading symbols from /usr/libexec/ld.elf_so...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/ld.elf_so
Core was generated by `nbmandoc'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f7ffdb5e040 in ?? ()
warning: .dynamic section for "/test-bed/dst/usr/lib/libc.so.12" is not at the expected address
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(gdb)

The traceback is rather lengthy so not included here. It is available, along with the dmesg and config, at [2], [3], and [4]


My initial reaction was that there was some sort of memory failure, so I ran memtest86+ for 10 passes - no errors detected. I also added another 4GB to the machine (for a total of 8GB), but the failure persists. The CPU (an AMD Phenom-II X6 1090T at 3.2GHz) is brand new, as is the ASUS M4A88T-M motherboard.

The problem occurs on -current as of today, but also happens using a GENERIC kernel from about a month ago.

I'd appreciate any clues on how to proceed...


Test results:

[1] http://www.whooppee.com/~paul/amd64-results/

Other info:
[2] http://www.whooppee.com/~paul/mandoc-backtrace.txt
[3] http://www.whooppee.com/~paul/mandoc-dmesg.txt
[4] http://www.whooppee.com/~paul/mandoc-config.txt




-------------------------------------------------------------------------
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |                          | pgoyette at netbsd.org  |
-------------------------------------------------------------------------


Home | Main Index | Thread Index | Old Index