NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: toolchain/57241: mips64el--netbsd-objcopy core dumps randomly



We dug into this the other day.

It's not actually mips64el--netbsd-objcopy that's crashing -- it's
mips64el--netbsd-install that's crashing, according to all the logs I
can find.  (The core file name is truncated to 16 characters by struct
proc::p_comm, hence `mips64el--netbsd', and there is an adjacent
warning from objcopy which appears to be unrelated.  Perhaps we should
have a separate PR to track the problem of unclear core provenance.)

I reviewed a core dump debug data from one of the crashes, and it
crashed here:

#0  0x000000000040c631 in be32dec (buf=0x72940e194400)
    at /usr/include/sys/endian.h:221
221     __GEN_ENDIAN_DEC(32, be)
(gdb) bt
#0  0x000000000040c631 in be32dec (buf=0x72940e194400)
    at /usr/include/sys/endian.h:221
#1  0x000000000040c758 in SHA256_Transform (context=0x7f7fff52bd50,
    data=0x72940e194400)
    at /home/source/ab/HEAD/src/tools/compat/../../common/lib/libc/hash/sha2/sha2.c:388
#2  0x000000000040cbd8 in SHA256_Update (context=0x7f7fff52bd50,
    data=0x72940e194400 <error: Cannot access memory at address 0x72940e194400>, len=2535104)
    at /home/source/ab/HEAD/src/tools/compat/../../common/lib/libc/hash/sha2/sha2.c:388
#2  0x000000000040cbd8 in SHA256_Update (context=0x7f7fff52bd50,
    data=0x72940e194400 <error: Cannot access memory at address 0x72940e194400>, len=2535104)
    at /home/source/ab/HEAD/src/tools/compat/../../common/lib/libc/hash/sha2/sha2.c:487
#3  0x00000000004049b5 in copy (from_fd=7, from_name=0x7f7fff53d70a "ipftest",
    to_fd=6,
    to_name=0x7f7fff53bf30 "/home/builds/ab/HEAD/evbmips-mips64el/202303111730Z-dest/usr/sbin/ipftest.inst.sUVfSU", size=3179200)
    at /home/source/ab/HEAD/src/tools/binstall/../../usr.bin/xinstall/xinstall.c:927
#4  0x000000000040448e in install (from_name=0x7f7fff53d70a "ipftest",
    to_name=0x7f7fff53bf30 "/home/builds/ab/HEAD/evbmips-mips64el/202303111730Z-dest/usr/sbin/ipftest.inst.sUVfSU", flags=0)
    at /home/source/ab/HEAD/src/tools/binstall/../../usr.bin/xinstall/xinstall.c:745
#5  0x00000000004038ea in main (argc=2, argv=0x7f7fff53cf38)
    at /home/source/ab/HEAD/src/tools/binstall/../../usr.bin/xinstall/xinstall.c:434

This occurs when xinstall computes the SHA-256 hash of the file it's
installing, and has just mmapped to read, here:

https://nxr.netbsd.org/xref/src/usr.bin/xinstall/xinstall.c?r=1.126#927

The crash is SIGBUS, which on x86 almost certainly means that the
mmapped file was truncated while it was being read.

This is probably a consequence of our kooky bsd.prog.mk rules for
handling debug data, where:

1. the rule for the program `foo' is to link foo with debug data
2. the rule for `foo.debug' is to copy the debug data out, and then
   strip debug data out of foo _in place_

This is a bug on its own and we should fix it, like we fixed it in
bsd.lib.mk -- have one rule to generate foo.full with everything, a
separate rule to derive debug data from it in foo.debug, and a third
rule to derive the stripped program from it in foo.

This could explain the crashes by the following chain of events:

(a) dependall depends on foo so it builds foo
(b) dependall depends on foo.debug which depends on foo so it happens
    later but, by rewriting foo in place, updates foo's mtime
(c) install depends on ${DESTDIR}/usr/bin/foo and
    ${DESTDIR}/usr/libdata/debug/usr/bin/foo.debug so it builds them
    in parallel:
     i. ${DESTDIR}/usr/bin/foo depends on foo which looks up-to-date,
        so make runs install
    ii. ${DESTDIR}/usr/libdata/debug/usr/bin/foo.debug depends on
        foo.debug, _which looks out-of-date because of foo's mtime_,
        so it runs the foo.debug recipe again which rewrites foo in
        place again

Thus, (c)(i) runs install in parallel with (c)(ii) which may truncate
foo (in the process of rebuilding foo.debug).

But it's weird that this only happens on mips64 -- not even mipsn64,
it seems.  And, from the records available to me, it's happened during
the install phase of:

- external/bsd/ipf/bin/ipftest (3x)
- usr/sbin/crash (1x)
- usr/bin/systat (1x)

Three times in ipftest is pretty suspicious.

One feature that these directories have in common -- which is also
peculiar to mips64 builds, other than sgimips -- is the use of
compat/exec.mk:

https://nxr.netbsd.org/xref/src/compat/exec.mk?r=1.7

That is, on mips64 builds, which normally use the n32 ABI, these
programs are built with the 64-bit ABI instead.

I don't have a theory for how compat/exec.mk could substantially raise
the probability of races in the bsd.prog.mk foo/foo.debug rules, but
the evidence suggests something about it does.

So while we should obviously fix bsd.prog.mk, like we fixed bsd.lib.mk
already, I suspect there's something else afish with compat/exec.mk
that we need to understand too.  And until we've spent some more time
to diagnose compat/exec.mk, and ideally figured that out, I think I'd
like to leave the bsd.prog.mk bug in so we don't paper over the
symptoms.


Home | Main Index | Thread Index | Old Index