Subject: Re: 3.0 YP lookup latency
To: None <jonathan@dsg.stanford.edu>
From: Christos Zoulas <christos@zoulas.com>
List: tech-net
Date: 06/21/2006 12:52:49
On Jun 21,  9:42am, jonathan@dsg.stanford.edu (jonathan@dsg.stanford.edu) wrote:
-- Subject: Re: 3.0 YP lookup latency

| 
| In message <20060621130038.CA56256534@rebar.astron.com>Christos Zoulas writes
| >On Jun 20,  2:48pm, smj@cirr.com (Stephen Jones) wrote:
| >-- Subject: Re: 3.0 YP lookup latency
| >
| >| Okay, this daily build (netbsd-3-0/200605220000Z) is displaying the  
| >| symptoms.
| >| It doesn't appear to be in rpcbind or ypbind itself, but an  
| >| associated library.
| >| It seems when the query is made the the ypserv process on the server  
| >| will log each
| >| user id in the password file when the -l flag is on.  Interestingly  
| >| the userids are
| >| sorted in alphabetical order .. any reason for this? ;-)
| >
| >My guess is that it is an artifact of the passwd file being hashed into
| >a db file. I guess I'll have to setup a yp domain myself and test.
| 
| hi Christos,
| 
| I don't buy your guess. As Stephen clarified (after the message to
| which you replied):
| 
| On Stephen's NetBSD-2.1 NIS client hosts, the client is issuing a
| single yp_match() call to the server.  But on Stephen's 3.0 clients,
| running the same userland tool (which, barring explicit size_t casts),
| hasn't changed between NetBSD-2 and NetBSD-3) the client iterates over
| Stephen's entire 27,000-entry NIS passwd.byname map, via yp_first()/yp_next().
| That's where the tens of seconds bites: not one individual RPC call
| but the 27,000-odd yp_next() calls.
| 
| I'm pretty sure the bug is  the client NIS library. If you look at
| 	lib/libc/gen/getpwent.c
| 
| you will notice that file changed radically between (CVS branches)
| netbsd-2 and netbsd-3.  getpwent.c only calls yp_match() or
| yp_first()/yp_next() in a couple of places.  Late last night I emailed
| you and Stephen and Soda-san a walk through the relevant code-paths. I
| think I've identified the problem, and suggested a workaround: don't
| use the supplied default nsswitch.conf
| 
| 	passwd:  compat
| 
| line, but instead use
| 
| 	passswd: nis [notfound=return] files
| 
| which avoids the compat_ parsing routine where I'm pretty this bug
| resides.  I also suggested a fix (assuming the workaround does fix
| Stephen's problem).
| 
| 
| In message Message-Id: <20060621130447.2B24C56534@rebar.astron.com>,
| Christos Zoulas continued:
| 
| >Sure I would be happy to work with you to resolve this. The first thing
| >to do is to ktrace both the server and the client process and then do
| >a kdump -R to see between which 2 system calls we have the most delay.
| 
| I understand why you ask (I initially asked for a libpcap trace).
| But, given Stephen's observation about his NIS-server logs I don't
| think either one would help.  yp_match() and yp_first()/yp_next() are
| libc functions, not system calls. So a ktrace would show the
| reads()/write() calls for the 27,000-odd yp_next() calls which we know
| (from Stephen's server-side logs) the NetBSD-3.0 NIS client is issuing.

Yes, it should show the difference between the # of calls in 3.x and
the number of calls in 2.x. But I am convinced that the reason is what
you mentioned; it should not iterate through the whole map.

christos