current-users: Re: increasing FD_SETSIZE to 1024 or 2048?

Subject: Re: increasing FD_SETSIZE to 1024 or 2048?
To: Jonathan Stone <jonathan@DSG.Stanford.EDU>
From: Todd Vierling <tv@pobox.com>
List: current-users
Date: 07/04/2000 17:38:13
On Tue, 4 Jul 2000, Jonathan Stone wrote:

: >I can understand that named is a possible exception, and again, you can (or
: >we can, in the base source tree) modify it such that it redefines FD_SETSIZE
: >to a larger number for its own use.
: 
: Yep. I sugggested that already.  Jason seems to buy it, though for
: reasons I dont understand, he prefers to conceal that in public.

This was probably more an oversight than an attack.  (It's been mentioned by
somebody else, so why rehash it again, etc.)

: That works when you are *compiling applications*. It does not work
: when you are using *precompiled binaries* (application or library).
: There, the issue is what *NetBSD* does.

Sort of.

NetBSD doesn't ship any precompiled libraries that use select().  
(libresolv actually uses poll(), which has no limit imposed by FD_SETSIZE.)

That leaves application binaries using select() and third-party software.  
The majority of third-party software doesn't need a larger FD_SETSIZE, so
compiling with the default is just dandy.  And binaries running under Linux
emulation aren't affected by the NetBSD define, either.

So, the only situation where this is a problem consists of a vendor shipping
NetBSD-native binaries depending on lots of open files with select() instead
of poll(), and too boneheaded to leverage standards regarding FD_SETSIZE.  
IMNSHO, the better solution to this is to bug-report the vendor.

: >There simply is no need to make everyone else spend more time copying mostly
: >empty fd_set's when the application has a Very Simple way to increase it
: >own pool.
: 
: *Shrug*. I would sooner bump NetBSD's default FD_SETSIZE, to something
: more in line with what, say, Linux does. The costs on Linux are widely
: accepted;

...because most Linux systems are higher-powered x86 boxes.

: I hope nobody from ex-CSRG is following this, but my guess is that the
: limit of 256 was a seat-of-the-pants estimate of a reasonable tradeoff
: at the time.  that was 10 years ago.  You can do a lot more with a
: desktop workstation now.

...because most *ix systems are higher-powered x86 boxes?

First consider that the average select() operation consists of a code block
similar to the following:

  fd_set rfds[2], efds[2];

  FD_ZERO(rfds[0]);
  FD_ZERO(efds[0]);

  FD_SET(fd1, &rfds[0]);
  FD_SET(fd1, &efds[0]);
  FD_SET(fd2, &rfds[0]);
  FD_SET(fd2, &efds[0]);

  rfds[1] = rfds[0]; // XXX see below
  efds[1] = efds[0];

  while ((n = select(max(fd1, fd2) + 1, &rfds[1], NULL, &efds[1], NULL)) > 0) {
    if (FD_ISSET(fd1, &rfds[0])) {
      ...
    }

    rfds[1] = rfds[0]; // XXX see below
    efds[1] = efds[0];
  }

Note the parts marked "XXX see below".  Because select() obliterates the
original fd_set contents, they have to be recreated before the next select()
call.  These copies aren't affected by the max(fd1, fd2) limit like the
select() syscall is; this copies _everything_, up to FD_SETSIZE.  Sure, you
could FD_SET() each filedescriptor again, but that doesn't scale for
programs that manipulate more than a couple of descriptors.

Now consider that the average socket read is a bit under 1500 bytes, the
normal size of Ethernet MTU, and that on a lesser-loaded system, select()
returns an average of 1-3 file descriptors even in the face of hundreds of
open files.  So that means that the copy of two fd_sets happens every
1000-1300 bytes or so.  That makes this array copy a big sticking point of
optimization for a network program, particularly on slower processors like
that DECstation in the corner of my room.

...As I stated above, poll() offers a much better solution, that has no
limits and provides directly reusable structures (no recopying operation
like select() above).  More programs are fortunately moving to poll(),
including libresolv (as reported by nm of libresolv.so on by i386-current
ELF machine).

-- 
-- Todd Vierling (tv@pobox.com)