Subject: bin/32519: ypbind spams syslog
To: None <gnats-admin@netbsd.org, netbsd-bugs@netbsd.org>
From: None <dholland@eecs.harvard.edu>
List: netbsd-bugs
Date: 01/14/2006 07:15:01
>Number:         32519
>Category:       bin
>Synopsis:       ypbind spams syslog if it loses connectivity
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 14 07:15:00 +0000 2006
>Originator:     David A. Holland <dholland@eecs.harvard.edu>
>Release:        NetBSD 3.99.10 (-20051026) (ypbind.c 1.53)
>Organization:
   Harvard EECS
>Environment:
System: NetBSD tanaqui 3.99.10 NetBSD 3.99.10 (TANAQUI) #3: Wed Oct 26 18:52:27 EDT 2005 root@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:
	A couple of days ago we had network maintenance in our building
	overnight, and so everything lost connectivity for a while. I
	came back the next morning to find the system log stuffed with
	messages like these:


Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
					/var/yp/binding/[domain].ypservers
Jan 11 07:30:26 tanaqui ypbind[307]: broadcast: sendto: Host is down
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
					/var/yp/binding/[domain].ypservers

	(Yes, more than one of these groups per second...)

	The problem, or at least a problem, appears to be this code:

		if (ypdb->dom_check_t >= now) {
			/* don't flood it */
			ypdb->dom_check_t = 0;
			check++;
		}

	(lines 298-302 of ypbind.c 1.53)

	Setting dom_check_t to 0 forces a check of the server at the
	next call to checkwork(). This happens at least once a second,
	but also after servicing rpc calls. So when the network goes
	down and processes start contacting ypbind, it'll discover
	right away that it's lost all the servers. However, once
	things are dead every subsequent request it gets will trigger
	another try at contacting the servers, possibly many times per
	second. I think this is what I was seeing.

	(And of course every time it tries to contact servers and
	fails, it prints assorted messages to the system log.)

>How-To-Repeat:

	Find a system using ypbind (that isn't also a yp server),
	pull the network plug, and wait a while. If necessary, have
	some cron jobs or the like start up while the network is down.

>Fix:

	I think the right answer is to make the test

		(ypdb->dom_check_t >= now && ypdb->dom_alive == 1)

	as if everything's already been found to be unreachable
	dom_alive will apparently be 2. However, I could easily be
	missing something and I'd like someone else to think about it.
	So I'm not including a patch. :-)

	Even with this issue fixed ypbind will still probably spam the
	system log once every five seconds (the traditional behavior)
	and it would be nice to have some exponential backoff on that,
	and/or have it suppress the error messages after the first
	time. However, this is a separate problem and I'd be inclined
	right now to leave it for later.

	I do however suggest the following patch to fix a misleading
	log message:

Index: ypbind.c
===================================================================
RCS file: /cvsroot/src/usr.sbin/ypbind/ypbind.c,v
retrieving revision 1.53
diff -u -r1.53 ypbind.c
--- ypbind.c	30 Oct 2004 15:57:43 -0000	1.53
+++ ypbind.c	14 Jan 2006 06:52:28 -0000
@@ -774,7 +774,7 @@
 
 		if (sendto(rpcsock, buf, outlen, 0, (struct sockaddr *)&bindsin,
 			   sizeof bindsin) == -1)
-			yp_log(LOG_WARNING, "broadcast: sendto: %m");
+			yp_log(LOG_WARNING, "nag_servers: sendto: %m");
 	}
 
 	switch (ypbindmode) {


	(And I know, the real fix for these problems is to not run
	ypbind. Unfortunately, that's not always an option.)