Subject: Re: netbsd-2-0/200403310000/sparc GENERIC.MP
To: None <tron@zhadum.de>
From: Havard Eidnes <he@netbsd.org>
List: tech-smp
Date: 04/11/2004 17:41:33
Hi,

I know that you're not responsible for this issue, Matthias, so don't
take this personally.  I've CC'ed Darrin who I think has been involved
in the relevant changes...

> In article <20040411010256.2D27E6620@void.crufty.net>,
> 	"Simon J. Gerraty" <sjg@crufty.net> writes:
> > Starting file system checks:
> > /dev/rsd0f: BAD SUPER BLOCK: VALUES IN SUPER BLOCK DISAGREE WITH TH=
OSE IN FIRST ALTERNATE
> >
> > /dev/rsd0f: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.
> > ffs: /dev/rsd0a (/altroot): EXITED WITH SIGNAL 11
> > /dev/rsd0g: BAD SUPER BLOCK: VALUES IN SUPER BLOCK DISAGREE WITH TH=
OSE IN FIRST ALTERNATE
> >
> > /dev/rsd0g: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.
> > THE FOLLOWING FILE SYSTEMS HAD AN UNEXPECTED INCONSISTENCY:
> >         ffs: /dev/rsd0f (/var), ffs: /dev/rsd0a (/altroot), ffs: /d=
ev/rsd0g (/l0)
> > Automatic file system check failed; help!
> > Apr 11 00:44:51 init: /bin/sh on /etc/rc terminated abnormally, goi=
ng to single user mode
> > Enter pathname of shell or RETURN for /bin/sh: =

> >
> > If I reboot off sd0 (1.6.2), it says the filesystems are fine.
>
> From "src/UPDATING":
>
> 20040109:
>         Compatibility for old ffs superblock layouts has been
>         added, and the restrictive fsck checks have been reenabled
>         when using those layouts.  If you have been using -current
>         since 20030402, you may find that fsck again signals fatal
>         superblock mismatches.  To repair, make sure you have
>         an updated fsck_ffs and then you can use fsck_ffs -b 16 -c 4
>         to complete the filesystem upgrade.  A message has
>         been added to the kernel which should detect this problem.
>         See the following discussion for more information:
>         http://mail-index.NetBSD.org/current-users/2004/01/11/0022.ht=
ml

Yes, yes, but I understood that Simon had been running 1.6.2 (i.e. not
a current kernel in the stated time interval?) up until the point
where he booted a 2.0_BETA kernel, or did I guess wrong there?  If I
guessed right, this behaviour would be exceedingly bad, as it would
create serious operational problems for people who do not have remote
console access and who go about upgrading their remote systems the
common manual way (kernel first, then user-land).

Please, could someone explain why the kernel or fsck_ffs can't
automatically fix this problem, instead of insisting on manual user
intraction on the console to bring the system back to life, especially
if the "file system is clean" flag is found to be set, since it would
indicate that the file system was cleanly brought down?

Even if the guess above is wrong, and that a -current kernel in the
stated interval has been used some time or other on Simon's system,
I'd say that the administrator impact of this problem is pretty
severe, and I think that a more administrator-friendly way of dealing
with the issue would have been preferable.

If upgrading from 1.6.2 to 2.0_BETA this way causes this problem, it
would be my personal assesment that this behaviour is a show-stopper
for shipping 2.0.

Regards,

- H=E5vard