NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: toolchain/57241: mips64el--netbsd-objcopy core dumps randomly



The following reply was made to PR toolchain/57241; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: gnats-bugs%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: toolchain/57241: mips64el--netbsd-objcopy core dumps randomly
Date: Tue, 23 Jul 2024 03:22:57 +0000

 We dug into this the other day.
 
 It's not actually mips64el--netbsd-objcopy that's crashing -- it's
 mips64el--netbsd-install that's crashing, according to all the logs I
 can find.  (The core file name is truncated to 16 characters by struct
 proc::p_comm, hence `mips64el--netbsd', and there is an adjacent
 warning from objcopy which appears to be unrelated.  Perhaps we should
 have a separate PR to track the problem of unclear core provenance.)
 
 I reviewed a core dump debug data from one of the crashes, and it
 crashed here:
 
 #0  0x000000000040c631 in be32dec (buf=3D0x72940e194400)
     at /usr/include/sys/endian.h:221
 221     __GEN_ENDIAN_DEC(32, be)
 (gdb) bt
 #0  0x000000000040c631 in be32dec (buf=3D0x72940e194400)
     at /usr/include/sys/endian.h:221
 #1  0x000000000040c758 in SHA256_Transform (context=3D0x7f7fff52bd50,
     data=3D0x72940e194400)
     at /home/source/ab/HEAD/src/tools/compat/../../common/lib/libc/hash/sha=
 2/sha2.c:388
 #2  0x000000000040cbd8 in SHA256_Update (context=3D0x7f7fff52bd50,
     data=3D0x72940e194400 <error: Cannot access memory at address 0x72940e1=
 94400>, len=3D2535104)
     at /home/source/ab/HEAD/src/tools/compat/../../common/lib/libc/hash/sha=
 2/sha2.c:388
 #2  0x000000000040cbd8 in SHA256_Update (context=3D0x7f7fff52bd50,
     data=3D0x72940e194400 <error: Cannot access memory at address 0x72940e1=
 94400>, len=3D2535104)
     at /home/source/ab/HEAD/src/tools/compat/../../common/lib/libc/hash/sha=
 2/sha2.c:487
 #3  0x00000000004049b5 in copy (from_fd=3D7, from_name=3D0x7f7fff53d70a "ip=
 ftest",
     to_fd=3D6,
     to_name=3D0x7f7fff53bf30 "/home/builds/ab/HEAD/evbmips-mips64el/2023031=
 11730Z-dest/usr/sbin/ipftest.inst.sUVfSU", size=3D3179200)
     at /home/source/ab/HEAD/src/tools/binstall/../../usr.bin/xinstall/xinst=
 all.c:927
 #4  0x000000000040448e in install (from_name=3D0x7f7fff53d70a "ipftest",
     to_name=3D0x7f7fff53bf30 "/home/builds/ab/HEAD/evbmips-mips64el/2023031=
 11730Z-dest/usr/sbin/ipftest.inst.sUVfSU", flags=3D0)
     at /home/source/ab/HEAD/src/tools/binstall/../../usr.bin/xinstall/xinst=
 all.c:745
 #5  0x00000000004038ea in main (argc=3D2, argv=3D0x7f7fff53cf38)
     at /home/source/ab/HEAD/src/tools/binstall/../../usr.bin/xinstall/xinst=
 all.c:434
 
 This occurs when xinstall computes the SHA-256 hash of the file it's
 installing, and has just mmapped to read, here:
 
 https://nxr.netbsd.org/xref/src/usr.bin/xinstall/xinstall.c?r=3D1.126#927
 
 The crash is SIGBUS, which on x86 almost certainly means that the
 mmapped file was truncated while it was being read.
 
 This is probably a consequence of our kooky bsd.prog.mk rules for
 handling debug data, where:
 
 1. the rule for the program `foo' is to link foo with debug data
 2. the rule for `foo.debug' is to copy the debug data out, and then
    strip debug data out of foo _in place_
 
 This is a bug on its own and we should fix it, like we fixed it in
 bsd.lib.mk -- have one rule to generate foo.full with everything, a
 separate rule to derive debug data from it in foo.debug, and a third
 rule to derive the stripped program from it in foo.
 
 This could explain the crashes by the following chain of events:
 
 (a) dependall depends on foo so it builds foo
 (b) dependall depends on foo.debug which depends on foo so it happens
     later but, by rewriting foo in place, updates foo's mtime
 (c) install depends on ${DESTDIR}/usr/bin/foo and
     ${DESTDIR}/usr/libdata/debug/usr/bin/foo.debug so it builds them
     in parallel:
      i. ${DESTDIR}/usr/bin/foo depends on foo which looks up-to-date,
         so make runs install
     ii. ${DESTDIR}/usr/libdata/debug/usr/bin/foo.debug depends on
         foo.debug, _which looks out-of-date because of foo's mtime_,
         so it runs the foo.debug recipe again which rewrites foo in
         place again
 
 Thus, (c)(i) runs install in parallel with (c)(ii) which may truncate
 foo (in the process of rebuilding foo.debug).
 
 But it's weird that this only happens on mips64 -- not even mipsn64,
 it seems.  And, from the records available to me, it's happened during
 the install phase of:
 
 - external/bsd/ipf/bin/ipftest (3x)
 - usr/sbin/crash (1x)
 - usr/bin/systat (1x)
 
 Three times in ipftest is pretty suspicious.
 
 One feature that these directories have in common -- which is also
 peculiar to mips64 builds, other than sgimips -- is the use of
 compat/exec.mk:
 
 https://nxr.netbsd.org/xref/src/compat/exec.mk?r=3D1.7
 
 That is, on mips64 builds, which normally use the n32 ABI, these
 programs are built with the 64-bit ABI instead.
 
 I don't have a theory for how compat/exec.mk could substantially raise
 the probability of races in the bsd.prog.mk foo/foo.debug rules, but
 the evidence suggests something about it does.
 
 So while we should obviously fix bsd.prog.mk, like we fixed bsd.lib.mk
 already, I suspect there's something else afish with compat/exec.mk
 that we need to understand too.  And until we've spent some more time
 to diagnose compat/exec.mk, and ideally figured that out, I think I'd
 like to leave the bsd.prog.mk bug in so we don't paper over the
 symptoms.
 


Home | Main Index | Thread Index | Old Index