Subject: lib/6379: RPC "backlog" isn't big enough
To: None <gnats-bugs@gnats.netbsd.org>
From: C Kane <ckane@best.com>
List: netbsd-bugs
Date: 10/30/1998 00:20:02
>Number:         6379
>Category:       lib
>Synopsis:       RPC "backlog" isn't big enough
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people (Library Bug People)
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Fri Oct 30 00:35:01 1998
>Last-Modified:
>Originator:     C Kane
>Organization:
>Release:        NetBSD-current, last update occurred Thu Oct 29 05:04:20 1998
>Environment:
System: NetBSD ckane5 1.3H NetBSD 1.3H (ckane5) #2: Mon Oct 19 21:52:45 PDT 1998 root@ckane5:/usr/netbsd-current/src/sys/arch/i386/compile/ckane5 i386


>Description:
        While stress-testing the NIS system, I get errors like this:

        ypcat: no such map group.byname.  Reason: RPC failure

>How-To-Repeat:
        We have a large environment where many systems might simultaneously
        be attempting to do "ypcat group" (as when someone logs in or
        when cron kicks off jobs on multiple systems at the same time).

        To simulate this load, I've tried running a test like this:

        for i in 1 2 3 4 5 6 7 8 9 10
        do
          ypcat group | wc -l &
        done

        We get failures with as few as 20 simultaneous jobs.

        The group map is large:  `ypcat group | wc` gives:  376 376 67173

        The problem can be tracked into the rpc libraries that ypcat uses.
        When the library routine attempts to "connect" a socket to portmap,
        it fails with errno ECONNREFUSED.

        I believe the reason for this failure is because portmap is using
        standard libc services to open it's listening socket, with a backlog
        of only two.

        I edited /usr/src/lib/libc/rpc/svc_tcp.c, line 168, from:
                (listen(sock, 2) != 0)) {
        to
                (listen(sock, 25) != 0)) {
        This results in much better performance.

        While all the jobs don't run simultaneously because some finish
        before they're all started, I've tried to start up to 140 jobs
        at once, and gotten no failures.

        Why is this value set to 2 and what problems might there be by
        setting it higher?  For real production work in my environment,
        I think I'd want to change the '25' to something even higher
        like '256'.  I'd prefer no "ECONNREFUSED" errors at all, within
	reason.
>Fix:
        A possible fix is given above.

>Audit-Trail:
>Unformatted: