port-sparc: Update: Re: "arptnew failed on <addr>" revisited

Subject: Update: Re: "arptnew failed on " revisited
To: None <port-sparc@NetBSD.ORG>
From: Greg Earle <earle@isolar.Tujunga.CA.US>
List: port-sparc
Date: 02/08/1995 19:24:42
I said:
> A while back I mentioned that under 1.0, I was running into a lot of problems
> with not being able to connect to certain hosts on my subnet, and when I
> tried, it would fail with errors like
> 
> Jan 30 16:08:55 netbsd4me /netbsd: arptnew failed on 8095180b
> Jan 30 16:08:55 netbsd4me /netbsd: arpresolve: can't allocate llinfo
> 
> I went into if_ether.c and decided to dig a little further.
> 
> It looks like some kind of problem adding LLC entries to the routing table
> and ARP tables for some hosts of ours which are dual-homed gateways ...

[ Long-winded explanation elided ]

> At this point I'm at a thin brick wall.  It's almost like as if "routed"
> starts up, and it gets these 2 routes from a dual-homed host, and the lower
> level code says "Wait, I'm looking for link-level info first, I don't want
> yer routes yet!" and it fails because it doesn't have the LLC stuff first.
...
> On one hand, I can't believe I'm the only person using NetBSD 1.0 on a subnet
> with other gateways.  On the other hand, maybe I'm the only one on a subnet
> where the other dual-homed hosts are emitting these 2 host routes for each
> of their interfaces, so perhaps it is something in this kind of environment
> that is confusing the networking code.

I'm now convinced that this is an accurate description of the problem, and
that -current is no different in this regard, since if_ether.c has not changed
one iota (save CVS id) from 1.0 to now.

My debug 1.0 kernel says:

Feb  8 15:38:21 netbsd4me /netbsd: arplookup: rt_flags & RTF_GATEWAY
Feb  8 15:38:21 netbsd4me /netbsd: arplookup: rt_flags & RTF_LLINFO == 0
Feb  8 15:38:21 netbsd4me /netbsd: arplookup: rt->rt_gateway->sa_family != AF_LINK
Feb  8 15:38:21 netbsd4me /netbsd: arptnew failed on 8095180b
Feb  8 15:38:21 netbsd4me /netbsd: arpresolve: can't allocate llinfo

If I then remove one of the two host routes to 128.149.24.11 (0x8095180b),
then I can immediately connect to it from then on.  Presumably because once
the route is gone, the next reference to the host becomes just like any other
to another host without an ARP table entry which needs to be ARP'd for and
resolved (which I presume generates the "link" type routing table entry).

I am convinced that the if_ether.c code demands that a host route to a host
which is on the same subnet as the interface the route is on be a "link" route.
If it receives a route via RIP from a host which declares a "normal" host
route (i.e., equivalent to "route add host foo foo 0") before having a "link"
route to said host in the routing table, it croaks as per the above.

It seems like this should be somewhat easy to fix.  I presume all the changes
necessary are to if_ether.c, and that only some twiddling of masks et al.
along with maybe a little code.  If anyone has any pointers on where these
changes should be made, drop me a line.  I'm particularly interested in this
piece of if_ether.c::arp_rtrequest():

	...
	if (rt->rt_flags & RTF_GATEWAY)
		return;
	switch (req) {

	case RTM_ADD:
	...
	case RTM_RESOLVE:
		if (gate->sa_family != AF_LINK ||
		    gate->sa_len < sizeof(null_sdl)) {
			log(LOG_DEBUG, "arp_rtrequest: bad gateway value");
			break;
		}
	...

I have this vague suspicion that if the req is RTM_RESOLVE without a previous
RTM_ADD, perhaps RTM_ADD should be called if gate->sa_family is not AF_LINK?

	- Greg