tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Common daemon models (was: getcontext()/setcontext()/makecontext() reliability?)

On Thu, 3 May 2012 19:07:17 -0500
David Young <> wrote:

> I'm looking into getcontext()/setcontext() because I think that its
> overhead may be way less than POSIX threads for an I/O server with
> potentially hundreds or thousands of active sessions.  Not so?
> In NetBSD, switching POSIX threads require, I thought, a user-kernel
> round trip.  I figure that's the costly part.

I unfortunately don't know without some benchmarking test...

As I have some daemon-writing experience, I'll however share some tips
here (sorry if it's already obvious to you):

If the service is very I/O-bound and not CPU-bound, the fastest would
be to maintain custom state, custom buffering, non-blocking
descriptors, with a good events loop (i.e. libevent, kqueue; IRCDs are
generally implemented this way, so are nginx and lighttpd if I
remember; this is also a good model for streaming servers, except if
disk I/O cause issues and cannot be delegated).  A single
process/thread may handle many clients very efficiently this way (with
fast and easy access to shared memory resources).  As necessary
multiple instances of this thread/process could be used to take
advantage of SMP and SMP-friendly syscalls/network-stacks.  What
becomes complex at this stage is shared state between those, if that's
required.  If accessing the shared data incurs too much overhead, the
above method pretty much fails, defeating all the efforts put at the
custom state machine and custom I/O handling.

Using multiple user contexts (i.e. *context(3)) will cause the same
latency as the above state machine if a function takes time to return
(non-preemptive scheduler) and will not allow scaling with SMP itself.
It will neither increase performance nor stability over the previous
scenario, so they could be considered syntactic sugar, which however
can be useful as an abstraction if the state machine is complex, or to
more easily migrate code to another model...  The above custom state
machine will generally outperform it, avoiding a full registers+stack
context switch.

Where multiple processes or LWPs/pthreads really become useful is for
CPU-hungry services, or to allow parallel I/O services for OSs
supporting this (in the latter case, those are often long-running
threads/processes which can serve one or many clients each, and which
may even accept(2) in parallel).  In share-nothing scenarios scaling is
very easy here.  It's quite more complex with shared state, though.
One also has to consider custom interthread communication for
performance, yet this will also conflict with FD-based polling,
requireing hacks such as passing messages/events via FDs or using a
user signal despite being in the same process, to wake up a polling
syscall to process a lighter-weight interthread event (which slow-path
will dissmiss completely the lighter-weight interthread messaging
system if it happens often).

A pool of processes is needed if the service requires
fork(2)/execve(2), privilege-separation, or if some
libraries/user-scripts are leaking and those processes must
occasionally be recycled (i.e. the Apache 1.3 way).  Of course, these
processes may also serve more than one client or use threads each
serving one or more clients, either via a custom state machine or
user-threads, hybrid scenarios of the above-mentionned techniques are
obviously possible, with the complex issues again involving shared

Of course, often one is constrained to use a certain model because of
the systems one needs to integrate with.  Examples:

- Third party embedded interpreter
- Shared resources (especially if via a third party system)
- Privilege separation
- Interactive configuration or script reloading/reparsing/recompilation
- Need to scale to multiple servers (a servers farm)

So obviously one has to first evaluate the application needs
before immediately using a specific model...

Finally, I will not go so far as to confirm the common "premature
optimization is the root of all evil" claim, but I can share that from
experience I have found the processes-pool method very good for many
ad-hoc development or complex scenarios, despite it not being
especially tuned to performance on a single system, because of all the
following advantages:

- Simple to implement
- Very reliable, a client process may crash or be killed without taking
  down the service, processes may be recycled to take care of resource
- Access to an external shared resource such a database server is
  common and may already slow down a server built around other models
- Multiple processes may still accept(2) and work in parallel taking
  advantage of SMP and also serving CPU-bound tasks, and having them
  accept(2) is faster than handing them an FD via an ancillary message
- Blocking I/O and polling is not an issue
- Easy to fork(2)/execve(2) or to use privilege separation
- No reentrancy issues, unlike with threads, legacy portable
  static-buffer interfaces can be used safely
- OS-guarenteed preemptivity
- OS-guarenteed automatic stack growing, which is not a portable
  scenario with threads
- May scale less than some of the other methods on a single system, but
  is easy to scale to a farm of servers with a frontend load-balancer
  I/O-bound proxy service (the latter can use the first described server
  model to efficiently delegate to heavier application servers)
- For configuration-reloading, the parent process may reload
  configuration/scripts and respawn children processes either
  immediately or lazily, with choice of either killing currently served
  connections or leaving existing sessions to complete cleanly (at
  reception of SIGHUP or custom control protocol message by the parent)
- Friendly to third party interpreters and libraries
- Friendly to disk I/O
- Depending on scenario, worker processes can serve many connections
  before being recycled, and the pool can be maintained at a size
  determined by configuration and average load, avoiding most of the
  exit(3)/fork(2) latency

Just food for thought, on a topic I find interesting :)

Happy hacking,

Home | Main Index | Thread Index | Old Index