Subject: Re: stability problems with NetBSD/sparc 1.3.2
To: None <port-sparc@netbsd.org, netbsd-users@netbsd.org>
From: Greg A. Woods <woods@most.weird.com>
List: port-sparc
Date: 10/22/1998 03:31:14
[ On Thu, October 22, 1998 at 02:47:43 (-0400), Charles M. Hannum wrote: ]
> Subject: Re: stability problems with NetBSD/sparc 1.3.2
>
> Do you have some problem with grep(1)ing the executables?  This is
> *not* a hard problem to solve.
> 
> find / /usr -xdev -type f | xargs grep -lo 'YOU SHOULD NOT BE HERE'

Unfortunately that's just the sort of heavy beating on the disk that
according to recent experience is likely to cause a crash....  If the
system makes it through the /etc/daily run that it's doing right now I
may start up a 'nice'd global grep before I go to bed.  The last crash
happened when I tried to count the number of lines of code in the entire
kernel tree so I could do a simple comparison between NetBSD and the
reported 37 million lines of code in Microsoft NT's "microkernel" alone.

> > Speaking of reboots -- I still seem to have to manually publish the arp
> > entry for my diskless workstation else it can't seem to boot itself more
> > than once.  It gets stuck with an "(incomplete)" entry.
> 
> This was allegedly fixed before 1.3 was released.  Are you sure you're
> running the release version of rarpd?

Neither the NetBSD-1.3.1 rarpd (aka 1.21.2.1), nor the -current rarpd
sup'ed as of Sun Jul 26 07:48:34 EDT 1998 (aka 1.26) have improved this
situation.  Something in either the 1.3.2 kernel or the rarpd.c-v1.26,
or possibly the combination seems a bit more robust though as I no
longer have to reboot the server in order to get the manual publishing
trick to work.

The only differences between the 1.3.1 and 1.3.2 rarpd's seem to be the
checks for failures of rarp_open() and a fix to skip non-ethernet
interfaces if '-a' is given.  These changes shouldn't make any
difference in my situation, especially since I never see the errors
which they are designed to avoid.  I only tried the 1.3.2 release binary
once without much luck so for now I'm still running the v1.26 copy.
Maybe I didn't give it enough of a chance and it does in fact fix the
problem....  That doesn't explain why it should have worked all the way
back in 1.3, but doesn't.

With the '-l' flag rarpd syslogs that it's doing the right thing, but
the change never shows up in the kernel table.  The reply gets to the
client too and tftpd gets the request for the boot file -- unfortunately
tftpd can't send to the client because the entry is incomplete and thus
eventually it simply gives up and syslogs "tftpd: write: Host is down".

At some point soon I'll try the new -current revision (i.e. 1.30) which
does have some significant changes.  Unfortunately this requires testing
on what I consider my production environment so it probably means
waiting for the next crash and then hoping I have time to fiddle with it
a bit before I get things back on-line again.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>