netbsd-help: Re: re0 failing after reboot. / Re: diskless woes

Subject: Re: re0 failing after reboot. / Re: diskless woes - init not recognized
To: None <tech-kern@netbsd.org, netbsd-help@netbsd.org>
From: theo borm <theo4490@borm.org>
List: netbsd-help
Date: 05/21/2005 00:04:18
Allen Briggs wrote:

>On Fri, May 20, 2005 at 11:10:14AM +0200, theo borm wrote:
>  
>
>>Sometimes, on a software initiated reboot, my realtek 8169s's fail.
>>
>>relevant dmesg output:
>>
>>re0 at pci0 dev 10 function 0: RealTek 8169S Single-chip Gigabit Ethernet
>>ukphy0 at re0 phy 7: Generic IEEE 802.3u media interface
>>re0: diagnostic failed, failed to receive packet in loopback mode
>>re0: attach aborted due to hardware diag failure
>>    
>>
>
>Have you tried using rgephy instead of ukphy ?
>
>I just started playing with re(4) on PowerPC.  One oddity that I noticed
>is that it doesn't autonegotiate properly until NetBSD configures the
>card.  On first powerup, it shows 10Mbit.  On a warm reboot, it shows
>100Mbit.  All on a 1000Mbit switch.  When NetBSD configures the card,
>it shows 1000Mbit (the lights don't show the duplex setting).
>
>-allen
>
>  
>
Thanks a lot for the hint. Unfortunately, unless there is a impelling reason
to reboot the cluster before that date, I will not be able to test it until
june 14th, when the next maintainance window is sheduled, but I definitely
will; the way it is now is totally unsatisfactory.

I thought that migrating a diskless cluster from 1.6.2 -> 2.0.2 would be
simple and straightforward; a one-day-job, resulting in gigabit connectivity
and better NFS support. Unfortunately that did not work out that way,
possibly caused by a  non statically linked init in the 2.0.x series 
install, as
suggested by Manuel Bouyer on the netbsd-help mailing list.

Thanks for that hint Manuel; I should really start reading the release 
notes.
Unfortunately,  it came too late; at that time I had only two options left:
revert to 1.6.2 or add local disks to all cluster nodes. I took the 
latter route,
grabbing every spare disk I could find, and spent the morning dd-ing, only
to find that occasionally cluster nodes wouldn't "come up" because of this
autonegotiation problem. Add to this that rebuilding installed packages
from pkgsrc was also not proceeding without troubles. Apparently the
"unsorted" list of source sites /is/ sorted, which means that a certain
sourceforge mirror gets hit by a lot of traffic, and as they are supposed
to be "regional", they started dropping my connection, but not until
they left my downloads dangle for (sometimes) 20 minutes, so I ended up
edditing fetch lists as well. :-/

Needless to say, I was getting nervous. (no panics though ;-) )

For me, the take home messages are to be a /lot/ more cautious next time,
that I should get extra hardware for testing purposes, and that heterogenous
clusters will probably be a maintainance/upgrade nightmare.

Eventually I will (hopefully) get rid of the local root disks again; they
are a proverbial pain to maintain (think about adding users/changing
passwords: a very simple wrapper script handled this on the 1.6.2
diskless cluster; now it is much more complex), plus they are a liability
in a 24/7 environment with a 40 C ambient temperature (computers
produce a lot of heat).

Thank you all for all the answers: in four weeks time I'll have three days
to conduct some tests. If something interesting comes out I'll let you
know.

cheers, Theo