port-hppa/56864: hppa: ptrace(2) dumps core when returning an error

To: port-hppa-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: port-hppa/56864: hppa: ptrace(2) dumps core when returning an error
From: tgl%sss.pgh.pa.us@localhost
Date: Sun, 5 Jun 2022 18:00:01 +0000 (UTC)

>Number:         56864
>Category:       port-hppa
>Synopsis:       hppa: ptrace(2) dumps core when returning an error
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-hppa-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jun 05 18:00:00 +0000 2022
>Originator:     Tom Lane
>Release:        HEAD/202206030100Z
>Organization:
PostgreSQL Global Development Group
>Environment:
NetBSD sss2.sss.pgh.pa.us 9.99.97 NetBSD 9.99.97 (SD2) #0: Fri Jun  3 12:30:06 EDT 2022  tgl%nuc1.sss.pgh.pa.us@localhost:/home/tgl/netbsd-H-202206030100Z/obj.hppa/sys/arch/hppa/compile/SD2 hppa
>Description:
I find that quite a lot of the /usr/tests test cases for ptrace
fail on my HPPA 9000/360.  For example,

$ cd /usr/tests/
$ atf-run lib/libc/sys/t_ptrace_wait

fails multiple subtests with symptoms like

tc-start: 1654446830.326273, user_va0_disable_pt_syscall
[ 79367.1962729] sorry, pid 3699 was killed: orphaned traced process
tc-se:Test program crashed; attempting to get stack trace
tc-se:[New process 5631]
tc-se:Core was generated by `t_ptrace_wait'.
tc-se:Program terminated with signal SIGSEGV, Segmentation fault.
tc-se:#0  0x0000bf30 in ?? ()
tc-se:#0  0x0000bf30 in ?? ()
tc-se:#1  0xae9cf488 in __cerror () from /usr/lib/libc.so.12
tc-se:#2  0xaf81aafc in ?? () from /usr/lib/libelf.so.2
tc-se:Backtrace stopped: previous frame identical to this frame (corrupt stack?)
tc-se:Stack trace complete
tc-end: 1654446834.563686, user_va0_disable_pt_syscall, failed, Test program received signal 11 (core dumped)

Notice that it's the tracing process, not the tracee, that crashes.
Trying to debug this interactively does not work too well because
gdb itself tends to fall over, with a very similar stack trace.
I eventually found that the problem occurs when the kernel returns
an error from ptrace (all the failing test cases expect an error).
ptrace.S branches to __cerror, which attempts to call __errno,
but the PIC stub it uses computes a garbage address and branches
to never-never land.  This is evidently because r19 doesn't point
to the right GOT at that instant.  It did mere nanoseconds ago, when
ptrace.S successfully used __cerror to set errno to zero before the
kernel call.  But that PIC stub is itself replacing r19 with some
other value (i.e., the GOT passed to ptrace/__cerror is different from
the one that is passed to __errno), and nothing undoes that change to
allow the call to work a second time.

This is probably a pretty ancient bug, but it would only manifest if libc
is built with -fPIC, and I'm not sure how old that choice is.
>How-To-Repeat:
$ cd /usr/tests/
$ atf-run lib/libc/sys/t_ptrace_wait

>Fix:
I find that this patch allows things to work for me:

Index: lib/libc/arch/hppa/sys/ptrace.S
===================================================================
RCS file: /cvsroot/src/lib/libc/arch/hppa/sys/ptrace.S,v
retrieving revision 1.7
diff -u -r1.7 ptrace.S
--- lib/libc/arch/hppa/sys/ptrace.S     9 May 2020 08:25:33 -0000       1.7
+++ lib/libc/arch/hppa/sys/ptrace.S     5 Jun 2022 17:38:00 -0000
@@ -42,6 +42,7 @@
        stw     %arg1, HPPA_FRAME_ARG(1)(%sp)
        stw     %arg2, HPPA_FRAME_ARG(2)(%sp)
        stw     %arg3, HPPA_FRAME_ARG(3)(%sp)
+       stw     %r19, HPPA_FRAME_ARG(4)(%sp)
        ldo     HPPA_FRAME_SIZE(%sp),%sp
        bl      __cerror, %rp
         copy   %r0, %t1
@@ -50,6 +51,7 @@
        ldw     HPPA_FRAME_ARG(1)(%sp), %arg1
        ldw     HPPA_FRAME_ARG(2)(%sp), %arg2
        ldw     HPPA_FRAME_ARG(3)(%sp), %arg3
+       ldw     HPPA_FRAME_ARG(4)(%sp), %r19
        ldw     HPPA_FRAME_CRP(%sp), %rp
 
        SYSCALL(ptrace)

I make no claim that this is complete or correct, because I'm quite
unsure what the conventions around saving/restoring r19 are supposed
to be.  In particular it's not real clear to me whether __cerror
itself shouldn't be responsible for this.  However, patching it
there would affect a lot more cases and probably slow things down.
It looks like there is an intentional decision for ptrace.S to
absorb the overhead of saving/restoring things so it can call
__cerror twice, and if so having it also save r19 seems to fit.

(Digression: why does ptrace.S need to call __cerror twice in the
first place?  Couldn't it reset errno to zero *after* a successful
call, so that there's just one such call in both the success and
failure paths?  That'd presumably eliminate the need for the
save/restore logic.)

BTW, I still see several failures in t_ptrace_wait; but the remaining
ones seem to have a different cause, which I've not investigated yet.

Prev by Date: Re: bin/56862: boot.cfg bug with userconf
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: kern/56863: please support IPsec extended replay window
Next by Thread: Re: port-hppa/56118 (sporadic app crashes in HPPA -current)
Indexes:

Home | Main Index | Thread Index | Old Index