NetBSD truss(1), coredumper(1) and performance bottlenecks

To: NetBSD Discussion List User-Level Technical <tech-userlevel%netbsd.org@localhost>
Subject: NetBSD truss(1), coredumper(1) and performance bottlenecks
From: Kamil Rytarowski <n54%gmx.com@localhost>
Date: Fri, 24 May 2019 05:40:15 +0200

( NetBSD 8.99.38 or newer required )
( these tools are experimental and can kill a traced process )

I wrote an experimental version of NetBSD truss(1), modeled after
FreeBSD truss(1).

https://github.com/krytarowski/picotrace

It was started as a fork of the FreeBSD one, but quickly became a
full-rewrite as the decoding part was specific to the FreeBSD kernel and
the ptrace(2) usage was FreeBSD-only as well. There was not much for
code sharing.

I've picked picotrace, forked and extended to mimic FreeBSD truss(1).

NetBSD truss(1) indents to implement all the features of FreeBSD
truss(1) and extend it whenever useful with additional features. At this
point of time all the features should be implemented and committed to
the tree, but not all of them are reliable. I've extended the tool to
handle SIGINFO for the counting mode (-c) and printing AUXV for tracing
& printing events mode (-x).

Example output:

1. ./truss -o /tmp/build.sh-2019-05-24.txt -axe -f -H -d -D
/usr/src/build.sh

http://netbsd.org/~kamil/truss/build.sh-2019-05-24.txt.bz2

$ bzcat build.sh-2019-05-24.txt.bz2 |head

 9267      1 0.000715337 0.000715337 mmap(0x0, 32768,
0|PROT_READ|PROT_WRITE, 0|MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)=  0xf7ef7000
 9267      1 0.000897229 0.000120463 open("/etc/ld.so.conf", O_RDONLY)
              =  0 Err#2 ENOENT
 9267      1 0.000980901 0.000081141 open("/usr/local/lib/libc.so.12",
O_RDONLY)       =  0 Err#2 ENOENT
 9267      1 0.001057840 0.000074579 open("/usr/lib/libc.so.12",
O_RDONLY)             =  0x3
 9267      1 0.001135284 0.000075166 __fstat50(3, { dev=0,
mode="-r--r--r-- ", ino=12401082, nlink=1, uid=0, gid=0, rdev=-1,
atime=1558660527.138286790, mtime=1558420316.000000000,
ctime=1558425703.025156636, blksize=32768, blocks=4544})=  0
 9267      1 0.001161701 0.000024142 mmap(0x0, 4096, 0|PROT_READ,
0|MAP_SHARED, 3, 0)  =  0xf7ef6000
 9267      1 0.001192443 0.000028834 munmap(0x7f7ff7ef6000, 4096)
              =  0
 9267      1 0.001215649 0.000021266 mmap(0x0, 6373376,
0|PROT_EXEC|PROT_READ, 0|MAP_PRIVATE|MAP_ALIGNED(0x15), 3, 0)=  0xf7400000
 9267      1 0.001244366 0.000026852 mmap(0x7f7ff77df000, 69632,
0|PROT_READ|PROT_WRITE, 0|MAP_PRIVATE|MAP_FIXED, 3, 1961984)=  0xf77df000
 9267      1 0.001271844 0.000025604 mmap(0x7f7ff77f0000, 2244608,
0|PROT_READ|PROT_WRITE, 0|MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0)=
0xf77f0000

2. ./truss -c `pgrep firefox`

http://netbsd.org/~kamil/truss/firefox-count-2019-05-24.txt

truss[9968] running for 61.849759335 seconds
truss[9968] attached to child=10769 ('firefox') for 61.848690315 seconds
syscall                     seconds      calls     errors missed-sce
missed-scx
read                    0.254737948       1947          0        790
   716
write                   0.811738104       1418          0        643
   934
open                    0.137601862        134         26         77
    54
close                   0.005132063         89          0         42
    43
unlink                  0.047976024         29          0         30
     5
getpid                  3.165382267       7359          0        514
  8200
getuid                  0.000451950          7          0          5
    41


Observations:

1. We can observe in both examples that there is a performance issue in
build.sh, as something attempts to call 5000 times read(2) for 1 byte only.

I plan to extend NetBSD truss(1) to print program name next to PID number.

2. Firefox and Thunderbird and certainly other similar software calls
excessively gettimeofday() and clock_gettime(). At least around 100k
times per 1 minute, and the program spends around 30sec (cumulative time
from all LWPs in a process) in the kernel space prompting for the
current time.

__gettimeofday50        2.158735930      17989          0        963
  4141
__clock_gettime50      21.415106767      73068          0       7484
 22300


There is also a bug in the NetBSD kernel that whenever we catch more
than 1 event in the same time, one of them is registered and others
lost. This affects multi-threading tracing. This affects all debuggers
with support debugging of programs with multiple threads (and likely
multiple processes).

I've discussed with Jason Thorpe how to optimize this and it shall be
put into a dedicated thread.

I have noted few other ptrace(2) and kernel behavior issues and I put
them on my TODO list and I will work on fixing them.

The bottlenek

NetBSD coredumper(1)

Some applications mask crash signals and implement homegrown fallback
mechanisms for breakage (vim, firefox, Xorg and whatever tries to be smart).

In the past, I used to use this core(5) dumper for firefox, removing
signal handlers for interesting signals:

http://netbsd.org/~kamil/firefox-dumper/custom_sigaction.c

Now, with coredumper(1), this workaround is no longer needed. With
recent kernel changes crash signals are correctly reported to debuggers
always. regardless of signal masking in processes. I learnt during
BSDCan that people in some other OSs keep removing signal handlers
manually in order to debug bugs..


NAME
     coredumper – crash detector and core dumper

SYNOPSIS
     coredumper [-c CORENAME] [-i] [-p | command [arg ...]]

Whenever program triggers crash, coredumper(5) catches this and creates
a core file.

I plan to extend this program with functionality to print backtraces
instead of dumping core files that can be too large in some circumstances.

My recent motivation to get this program done was investigating X
crashes, but they are extra tricky as X uses privileged processes that
are harder to debug.

Full example:

http://netbsd.org/~kamil/coredumper/example-2019-05-24.txt

In practice a similar behavior to truss(1) and coredumper(1) could be
achieved with an existing debugger (especially with Python/similar
bindings), however the purpose of these new programs is to perform one
single task in an easier way to setup.

ktruss(1) is similar to truss(1), but uses kernel-based tracing.
truss(1) uses ptrace(2) and is more flexible in terms of prompting the
traced process for whatever information we are interested in a child
process.

NetBSD truss(1) and NetBSD coredumper(1) are intended to be completely
MI (except LP64 check).

For now, I'm switching back to fixing kernel ptrace(2) behavior in order
to make such tools (and consequently debuggers like GDB/LLDB) more
reliable. truss(1) and picotrace(1) exposed some new issues and helped
me to understand why tracing programs with multiple LWPs is still not
reliable.

Attachment: signature.asc
Description: OpenPGP digital signature

Follow-Ups:
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Robert Elz
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Michael van Elst

Prev by Date: Re: static_assert
Next by Date: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
Previous by Thread: static_assert
Next by Thread: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
Indexes:

Home | Main Index | Thread Index | Old Index