Re: NetBSD truss(1), coredumper(1) and performance bottlenecks

To: Kamil Rytarowski <n54%gmx.com@localhost>
Subject: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Sat, 25 May 2019 09:19:39 +0700

    Date:        Sat, 25 May 2019 02:04:13 +0200
    From:        Kamil Rytarowski <n54%gmx.com@localhost>
    Message-ID:  <4fefdf41-44fa-12f9-705d-5187732d7c95%gmx.com@localhost>

  | As far as I'm aware we can use read(2) and write(2) in pipes with longer
  | transfers than 1 byte.

Of course.  But once read we cannot go back (which can be done reading
a file).   Michael's point was that the semantics of "read" in sh(1)
are that it must leave the file pointer pointing at the byte immediately
after the terminating \n, so any other command which follows can continue
reading and obtain the next byte of data.

When reading from a pipe the only way to do that (that we currently
have anyway) is to read 1 byte at a time, so when that one byte is the
terminating \n we have not read the next byte and it is still there
in the pipe for the next processs to read.

Writes into pipes aren't affected by this - what matters there is
how data from different writes by different processes might get
intermixed.   Typically a process ought to be writing one (or more)
complete "records" in a single sys cal, and then, if it is not too
big, nothing will be interspersed with it.

The "currently have anyway" is because we could add a mechanism to
have the kernel do the filtering for us - add a way to say "nothing
beyond the next \n please" (or whatever) and then we could do bigger
reads, knowing that data that we do not want to consume would not
be made available.   That's easy as long as we stick to one byte
terminators, it gets much messier when we start allowing for the
possibility of wider ones (because part of the terminating sequence
might have been passed back in a previous read(2) request - so the
kernel would need to remember what it had previously returned.)

Whether any of this is worth it or not is a judgement call, or would
take implementing and then measuring the results.

  | But the real question here is what is heavy in the build infrastructure.
  | 5k times transferring 1 byte was just a potential starting point.

Yes, that's a question that should be answered - and simply counting to
see which sys call is executed most won't get us there.  A lesser number
of more expensive sys calls might be the real issue (or one of them).

  | My observation was general that this syscall is frequently called by
  | many programs. Optimization of it can potentially change responsiveness
  | of the whole system.

Yes, gettimeofday() is very common - but we need to investigate how
to speed it up, not just presume that a mapped page is the right answer.

Using a mapped page would mean processes would only see the time as it was
last updated in the kernel - which means repeated frequent calls would get
the same exact time, perhaps for a (relative to computer operations) long
time (milliseconds, or more).    As I understand the current implementation,
each time a gettimeofday() is done, the high precision clock is read, and
the current results returned, which means it is more likely than not that
two successive calls will return different results - especially when using
one of the clock reading variant syscalls that returns a timespec rather
than a timeval.

It may be more productive to look at the use cases - what exactly is the
requirement the application is meeting by frequently calling gettimeofday()
(or clock_gettime()).   If they don't need high res results, that might
lead to a different optimisation than would be inappropriate for uses which
do - meaning that we may need to provide different APIs.

One thing I do know is that doing almost anything related to time ends
up being far more complicated than is ever believed when starting out,
and is never something to be undertaken lightly.

  | At some point of time Joyent optimized bulk builds of pkgsrc from 2 days
  | to 3 h. There are certainly low-hanging fruits in build.sh as well.

I am sure there are, but I very much doubt that build.sh is really something
itself that ought to be a target of investigation.   All it is is a wrapper
around make.   All the real work is done in make, and all that it calls.
Speeding up build.sh itself is very unlikely to change anything, unless we
can find entire runs of make that we can optimise away.

  | I'm not sure that this would be a real concern here to skip gettimeofday
  | calls in strace-like programs.

One potential solution might be to find a way to make combined syscalls,
where one user/kernel boundary crossing performs multiple syscalls.

What that would look like, I have no idea, but the expected result would
be that when a record comes back from ktrace() (or whatever the program
is using) a time value would accompany it (or whatever other syscall
result is needed).

kre

Follow-Ups:
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: David Holland
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Robert Elz
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Kamil Rytarowski

References:
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Kamil Rytarowski
- NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Kamil Rytarowski
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Michael van Elst
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Kamil Rytarowski
- Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
  - From: Michael van Elst

Prev by Date: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
Next by Date: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
Previous by Thread: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
Next by Thread: Re: NetBSD truss(1), coredumper(1) and performance bottlenecks
Indexes:

Home | Main Index | Thread Index | Old Index