Subject: bin/572: rare bootpd + arp failures
To: None <gnats-admin@sun-lamp.cs.berkeley.edu>
From: None <jarle@ed.unit.no>
List: netbsd-bugs
Date: 11/13/1994 03:20:05
>Number:         572
>Category:       bin
>Synopsis:       rare bootpd + arp failures
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    gnats-admin (Utility Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Nov 13 03:20:04 1994
>Originator:     Jarle F. Greipsland
>Organization:
Department of Redundancy Department
"	"
>Release:        1.0
>Environment:
	
System: NetBSD hood.ed.unit.no 1.0 NetBSD 1.0 (HOOD) #2: Mon Nov 7 00:42:11 MET 1994 jarle@hood.ed.unit.no:/usr/src/sys/arch/i386/compile/HOOD i386
Plain NetBSD 1.0
Intel EISA machine, 486/66, 3 WD8013 boards, bootp-serves about 30 OS/2 
boxes on one segment, approx. 20 on the other one.  Acts as gateway for 
these machines towards the rest of the world.  Some mail traffic.

>Description:
	
Every now and then a bootp request from one of the clients will fail, and 
bootpd enters the following message into /var/log/messages:
Nov 10 10:22:32 hood bootpd[25500]: arp failed, exit code=0x100
and sometimes (but only very rarely) the following message also appears:
Nov 10 13:14:14 hood /netbsd: arptnew failed on 0
I don't know if these two are correlated, but I thougt I'd mention it.
Anyway, it seems like whenever bootpd tries to enter an already existing 
entry into the table, 'arp' does an exit(1), and bootpd interprets this as an
error, and quits.  My first fix was to, if the arpset failed, to see if 
'arp hostname' would succeed (i.e. the address is already there).  Then it 
worked mostly OK, but sometimes the clients would still hang.  Some more
digging, and it seems that the entries in the arp table 'times out', and any
lookup at an inappropriate time gives an entry of the form: 
oci.ed.unit.no (129.241.180.120) at (incomplete)
bootpd don't seem to work well with one of these.  So my final fix is, if
the first arpset fails, to delete the entry, and then to reinsert it.  It
now seems to work.
>How-To-Repeat:
Run a bootp server with many clients for a long time..... Sorry for the lousy
How-To-Repeat-section, but I cannot reproduce this 'at will'.
	
>Fix:
	
Apply diff.  I'm not sure if this is the proper way to solve this problem, 
but I was in a hurry, with people talking behind my back about switching to 
Solaris or Linux.  Anyway, a suggestion for a more permanent fix would be 
to pull most of the funcitonality in the arp program into a library, and 
let the programs that need to talk with the arp-table use the library 
instead of an external program.  Oh well.

						-jarle
----
"... except from the fact that it doesn't work, what do you think about the
     program?"

*** /usr/src/usr.sbin/bootpd/hwaddr.c.orig	Thu Nov 10 11:45:22 1994
--- /usr/src/usr.sbin/bootpd/hwaddr.c	Thu Nov 10 19:11:21 1994
***************
*** 136,141 ****
  		report(LOG_INFO, buf);
  	status = system(buf);
! 	if (status)
! 		report(LOG_ERR, "arp failed, exit code=0x%x", status);
  	return;
  #endif	/* SIOCSARP */
--- 136,149 ----
  		report(LOG_INFO, buf);
  	status = system(buf);
! 	if (status) {								/* arp set failed! */
! 		int status2;
! 		sprintf(buf, "arp -d %s", inet_ntoa(*ia));
! 		(void)system(buf);						/* delete entry */
! 		sprintf(buf, "arp -s %s %s temp",
! 				inet_ntoa(*ia), haddrtoa(ha, len));
! 		status2 = system(buf);					/* and set it again */
! 		if (status2)
! 			report(LOG_ERR, "arp failed, exit code=0x%x", status);
! 	}
  	return;
  #endif	/* SIOCSARP */
>Audit-Trail:
>Unformatted:

bootpd + arp fail to insert entry into the arp table
sw-bug