Subject: diskless woes - init not recognized
To: None <netbsd-help@netbsd.org>
From: theo borm <theo_nbsdhelp@borm.org>
List: netbsd-help
Date: 05/19/2005 23:06:11
dear list members,

I've just spent 2 days trying to figure out what is going wrong with
my diskless cluster nodes, and have become utterly frustrated.

I have a cluster of diskless nodes: cheap i386 (actually AMD XP2600's)
with a reasonable amount of memory, onboard VGA and via rhine (vr0)
NIC, and an extra realtek 8169S (re0) PCI NIC. I would very much like
to be able to netboot a stable NETBSD release (production system; no
time to chase current), *and* be able to use the gigabit connectivity.
Currently the nodes work, with a lot of buts and ifs (NFS and vr0
do not seem to like each other) under 1.6.2, which does not support the
realtek re0's, so I want to migrate the nodes to a 2.x series kernel.
The front end to this cluster is an NFS server which currently runs
2.0.2. It exports individual root directories to the diskless nodes.

Replacing the 1.6.2 installation in diskless cluster nodes with
fresh 2.0.2 installations (from a vanilla 2.0.2. release distribution)
results in the system returning "exec /sbin/init: error 8", eventually
resulting in a kernel panic "panic: no init"

I have confirmed (tcpdump) that /sbin/init is read, and that the
kernel then looks for oinit and init.bak as well (which it does
not find unless I install them (of course) )

I have confirmed that /sbin/init matches the 2.0.2 kernel, that the
client can access them and that their access rights are ok. As a
matter of fact they are binary identical to the ones that the server
is currently running (and I did a fresh install from a 2.0.2. release
CD burnt from an "official" ISO image)

error 8 seems to mean that the kernel chokes on the executable.

If I replace the 2.0.2. /sbin/init by the 1.6.2 /sbin/init, this file
/is/ properly executed (COMPAT 16), after which the system
tries to drop into single user mode, but can't because /bin/sh is
not in the correct executable format. Only if I downgrade all
bin & sbin binaries to 1.6.2 does the system (more or less) boot
again.

What am I doing wrong? I have fidled with differnent NFS
block sizes, with different network cards (3com, intel, tulip,
via, realtek, dec), with different boot methods (netboot floppy
vs PXE boot rom) etc. etc, and I am running out of clues....
does anyone have running system with diskless 2.0.2 i386 nodes
and a 2.0.2. i386 NFS server?

with kind regards,

Theo Borm