port-mac68k: Re: Booting read-only? (vi still hosed)

Subject: Re: Booting read-only? (vi still hosed)
To: None <jope@n2h2.com>
From: Bill Studenmund <wrstuden@loki.stanford.edu>
List: port-mac68k
Date: 08/03/1998 22:16:13
[Note: I know my mailer is acting like it's 1997. It has to do with
access codes and the fact this machine's not running NetBSD :-(  ]

> On Mon, 3 Aug 1998, Colin Wood wrote:
> >> I still get a "bad system call" error from vi and then a core dump, 
> >> and likewise with more. 
> > 
> > sounds like you missed something in your reinstall (or else something
> > is a bit corrupted).
> 
> Well, the fact that I've been installing one set of binaries over another,
> rather than wiping out the old ones first, makes me somewhat wary.
> Now that I'm read-write and know rm works, I could do an "rm -rf /" to
> clear everything out first, yes?  Or is simply formatting again with 
> Mkfs the suggested method of doing this?  

What would you do after the rm? You've now deleted ALL the programs which
could install anything else...

> Corruption seems a little too convenient an answer, but if all else fails
> I'll download the base package again to see if that helps.

If you still have the base.tgz file around, cpin it to NetBSD (assuming
you have the space) and run cksum, the checksummer, on it. If you get the
same checksum as in the distribution (we're talking 1.3.2, aren't we?),
then you downloaded fine.

As David mentioned, you might have library problems. If you didn't
delete /usr/lib (or a directory which houses it...) before
re-installing, then all the newer broken shared libraries are still
there. They will have larger version numbers than the ones from a distribution
(either 1.3, 1.3.1, or 1.3.2), and so will be preferentially used.

To fix this, I'd suggest booting single-user, mounting /usr r/w (root
r/w if you have a combined root&usr), and cd'ing to /usr/lib. Do an ls
-l |more and look at the files. The libraries all are named libXXXXYY
where XXXX is a word that makes the library unique (like in libc,
libutil, libposix, etc.), and YY is stuff that says what this file
does. the YY par will start either with a period, or an underscore.
The -l in the ls will maybe help you with dates.

Before I go too much farther, here's what ls libc* shows on my i386:

libc.a              libcom_err.a        libcompat_p.a       libcurses.a
libc.so.12.20       libcom_err.so.2.0   libcrypt.a          libcurses.so.2.2
libc.so.12.27       libcom_err_p.a      libcrypt.so.0.0     libcurses_p.a
libc_p.a            libcom_err_pic.a    libcrypt_p.a        libcurses_pic.a
libc_pic.a          libcompat.a         libcrypt_pic.a

libc, libcom_err, libcompat, libcrypt, and libcurses are showing up.
libc.a is the libc used when making a staticly-linked program - a program
which will not need other pieces (shared libraries) to run. libc.so.12.20
and libc.so.12.27 are two versions of the libc shared library. libc_p.a
is used when you want to do profiling of your program, and I've forgotten
what libc_pic.a is for.

The important things is that I have two libc.so.12's. port-mac68k is
using what are called a.out shared libraries. They have major and minor
numbers. Each program get linked with a particular version of the
library. If the library is linked as a shared library, then it's not
actually stuffed in the program, but made available when the program's
run. This saves space. Think about it, libc is used by almost every
program, and is over 400k. Even if all the programs only used 1/4 of it,
it's a lot easier to have one 400k file rather than hundreds of 100k
chunks of the file.

Also, and here's where things get VERY relevant to you, is that if there's
a NEWER version of libc around with the same major number (12 here) but
a larger minor number, then it gets used instead of the original one.

So if for instance the "ok" distribution you re-installed used
libc.so.12.15, but the newer snapshot (which causes problems) used
libc.so.12.28, even though all your programs were build for .12.15,
they will use the .12.28 version, which doesn't work with your kernel.
Note: the .15 and .28 are made up, your values will probably differ.

So find these newer (higher-numbered files), and delete them. Check out all
the libraries, not just libc. Things should then work.

> > you got me on this one.  i don't have any ext2fs partitions, and i'm not
> > really running -current at the moment (although i guess that should change
> > sometime in the near future).  when is this mount occurring, before
> > running /etc/rc or during this process?  does you /etc/fstab refer to your
> > / partition as 'ffs' or 'ext2fs'?
> 
> But I don't have any ext2fs partitions either, not since I nuked them in
> order to remove that as a possible cause of the problem.  I'll have to
> note the exact message when I get home, but it was right after where
> the root partition is identified.  Should be the standard /etc/fstab
> created by doing "fstab force" in the Installer mini-shell. 

NO! /etc/fstab has nothing to do with mounting the root partition, though
it has everything to do with mounting all the others (well, you might be
able to have the pedantic case where your single-user root device isn't
the same as in it's /etc/fstab, and so you get a different root partition
after single user.... But let's not go there now).

/etc/fstab can't be used to find the root partition as it can't be
found until after the root partition's mounted. :-)

If this ext2fs check-w/-failure is happening at the "root on XXXX"
point AND it's actually figuring out that the partition's an
ffs partition, DON'T WORRY ABOUT IT! The kernel's just being chatty.

If either it doesn't figure out the filesystem on that partition is an
ffs partition (or the ext2fs code chokes when it actually is an ext2fs
filesystem) (set accent=NewYork) _then_ you got problems. Do you gots
problems. (unset accent)

Take care,

Bill