Subject: Disturbing experience
To: None <port-sparc@NetBSD.ORG>
From: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
List: port-sparc
Date: 04/20/1995 17:36:16
I just had a disturbing experience with NetBSD/sparc.  As some of you
may recall, I have a SPARC that I've been running under 1.0 for some
time.  I finally got around to working on the great move to -current.
I pulled over the binary snapshot from /pub/NetBSD/arch/sparc/snapshot
on ftp.netbsd.org, unpacked it onto another disk, wrote a small fstab,
ran "MAKEDEV all", booted SunOS briefly to run installboot, and then
tried to boot from it.  It came up apparently fine, loading the new
kernel, finding the correct root disk (thanks to careful choice of SCSI
ID settings and the 1.0 kernel being configured with sd0 nailed down as
target 3), and prompted me with the usual "Enter pathname of shell or
RETURN for sh" bit.  I typed RETURN and got a # prompt.

I then typed "fsck /dev/rsd0a" and got "Segmentation fault".  I mounted
/usr by hand (still with everything read-only) and started poking
around.  "file /sbin/fsck" reported "data".  I tried running it again
and got error messages from the shell, indicating it was trying to
execute the thing as a shell script.  I then mounted the other disk
(read-only) by hand and re-extracted the sbin stuff into another
directory.  I then ran the newly-extracted fsck, and it worked just
fine.  I then tried the old one, and it still failed.  I then compared
them with cmp and there was no output, indicating they were identical.
So I tried to flush any caches, by running "find /usr -type f -print |
xargs cat > /dev/null", letting it run for a little, and killing it.
Didn't help a bit.  I then "mv"ed /sbin/fsck to another name, "cp"ed
the newly-extracted fsck to /sbin/fsck, and tried running them both.
The old one, under its new name, failed as before; the new one, newly
named /sbin/fsck, worked.

I then rebooted, tested them both, and found they work fine now.
Both of them.

This sounds to me like something vaguely akin to the lossage reported
where running tail -f on a file could break something else - vi, I
think - due to something funny with mmap().

This is not really a bug report; the problem does not seem to be
reproducible enough for that.  I'm kicking myself for not having
panicked the machine and taken a kernel coredump, but if it happens
again I definitely will.

I just now (20 April 17:30) fetched a fresh copy of sbin.tar.gz from
ftp.netbsd.org and compared it against the one I extracted; they are
identical.  (This does not surprise me in view of the symptoms.)

Comments, anyone?  The snapshot kernel I used was the
"netbsd.root_on_sd0" one.

					der Mouse

			    mouse@collatz.mcrcim.mcgill.edu