Subject: Re: fsck
To: Jon Ribbens <jon@oaktree.co.uk>
From: Jim Reid <jim@mpn.cp.philips.com>
List: netbsd-help
Date: 04/07/1997 17:41:23
>>>>> "Jon" == Jon Ribbens <jon@oaktree.co.uk> writes:

    Jon> Can anyone recommend a good instruction manual for fsck? The
    Jon> manpage is, as usual, not very helpful. fsck itself, while
    Jon> running, has a tendency to report incomprehensible errors,
    Jon> followed by asking unanswerable questions. ("REMOVE y/n". So,
    Jon> what happens if I say "yes"? What happens if I say "no"? Do I
    Jon> *need* to say "yes" in order for the disk to be fixed? What
    Jon> bad things might happen with either choice? Why is it even
    Jon> asking me - what problem has it found?)

A comprehensive manual can be found in /usr/share/doc/smm/03.fsck.
Or in any decent book on UNIX system administration.

Generally speaking, the best idea is to always answer yes to fsck's
questions. This will usually recover as much "lost" data as possible
and put the filesystem back together again. [There are a few, rare
occasions where this is not so. Even then, just answer "yes" and then
reach for your backup tapes.] Answering "no" to fsck's questions will
usually mean the filesystem remains corrupted. The damage fsck found -
and could have fixed - goes unrepaired. If you continue to use the
fileystem, the corruption and data loss will get worse. The -y option
saves tedium and typing when there is a lot of fixing to be done. The
-n option give the system administrator an idea of the extent of the
damage to be repaired, just in case something important could be lost.

The general idea is that when the filesystem is in a bad way, fsck
wants a human being to know about it. For example, he/she can take
note of which files were found to be corrupted so they can be restored
from backup once fsck has done its stuff.

This is an area where experience and judgement help. For example, if
there are lots of errors, it usually means the disk is bust or on its
way out: probably a head crash. In this case, it would probably be
better to throw the disk away and rebuild the filesystem(s) from
backup. When fsck has done a bit of work, it's usually a good idea to
re-run fsck. The first pass might have done enough to make the
filesystem usable again - say by rebuilding the super block and
cylinder group summary blocks - but not enough to have picked up all
the inconsistencies.

BTW, I have to differ with your comments about "incomprehensible
errors" from fsck. The reports and questions it generates are clear
enough to me, though admittedly it does help if you know how a UNIX
filesystem and the ffs in particular is organised on disk.

    Jon> We just got a whole load of fsck errors (requiring a manual
    Jon> fsck) from a NetBSD 1.2 machine with nobody at all logged in
    Jon> which was rebooted unexpectedly. Surely this must indicate a
    Jon> bug in either ffs or fsck? (I thought the point was fsck was
    Jon> supposed to be able to fix this sort of thing without help.)

It depends on what was wrong. If fsck thinks the corruption is serious
enough, it will insist on human intervention (or at least making sure
a human is informed that Something Serious has happened). In that
case, fsck is behaving reasonably. That the UNIX filesystem doesn't
handle abrupt shutdowns well is not a bug, it's a feature... :-)

    Jon> We ran fsck, it asked us whether we wanted to remove a whole
    Jon> load of files, we said yes on the basis of flipping a
    Jon> coin. We ran it again after it had completed, and it then
    Jon> asked us if we wanted to reconnect all the same files again,
    Jon> so the net effect was to move them to
    Jon> /lost+found. Excitement.

Sounds like you don't know what's going on. You should have let fsck
recover all the files it could and then gone through lost+found to
rename them to their original places. fsck can't do the latter part of
that because the directory which held these files - ie their names -
got mangled somehow. So when fsck finds files/inodes that have data
associated with them but no name, it gives the system administrator
the option of recovering the files to lost+found or else deleting
them. The latter option is for files and directories that the system
administrator - not fsck! - decides don't matter. How could fsck tell
if the mangled file(s) hadn't been backed up or if they were so
important that anything that could be recovered from it was better
than nothing? In once case, I was able to retrieve a file that fsck
would have deleted because it had 1 corrupt block. For the user,
copying this file and regenerating the corrupt block was quicker and
preferable to waiting for a recovery of the previous day's file from
backup.