tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Small tip proposal for headless systems boot resiliency



    Date:        Mon, 30 Dec 2013 21:27:16 -0500
    From:        John Hawkinson <jhawk%MIT.EDU@localhost>
    Message-ID:  <20131231022716.GK22724%athena.dialup.mit.edu@localhost>

  | It seems to me that perhaps a better solution would be to stop booting
  | and configure a network interface and start an sshd. Now, sometimes
  | configuring a network interface is hard, but for those of us for whom
  | it isn't, this seems a path forward.

Seems like a useful solution for some cases to me.  And probably not even all
that hard (though whether or not it is productive would depend upon how the
sshd is configured, and whether or not at least one non-root user's .ssh/*
files were available.)

  | p.s.: Can we drop tech-embed?

Done.

For what it is worth, for some time now, I have been considering a different
idea to solve a different, but related, problem - I'm not so worried about
systems that I cannot get at, but I have systems that really need to be running
all the time, and need to boot and operate if they possibly can, even if
80% of the resulting system is not there - the rest can wait until I wake up,
or return from wherever, or whatever prevented me from immediately fixing 
things.

My issue is largely with fsck and filesystem repair - most of the time
the fsck -p (and/or log recovery) works fine, and the system reboots fully
by itself, but occasionally, there's some filesystem error that needs
assistance.  If that's on a crucual filesystem (root /usr /var etc) there's
little that can be done safely, but for most other filesystems the
system will run (with some reduced functionality) just fine - eg /home
isn't really needed on a system that is primarily a nameserver, and pkgsrc
and distfiles, and stuff (and filesystems for building updated NetBSDs)
aren't needed at all until they're actually needed (which for me certainly
means after I can fix them).

My basic strategy would be to enhance fstab in some yet to be determined way
to indicate which filesystems were optional, them simply ignore (without
mounting obviously) any optional filesystem that has a problem (or whose
parent filesystem could not be mounted) and only abort the boot if a
non-optional filesystem failed to check & mount.   Since in my case my
non-optional filesystems are (in practice) all remarkably stable (as in
nothing much important normally changes on them - the only significant
changes are the newsyslog updates, and for me, they don't often do anything,
so they are unlikely to have fsck problems) this would allow the system to
get up and running in situations where now it stops and waits for
attention.

I have also considered allowing all the optional filesystem repair and
mounting to happen after the system is running with the essential
filesystems - not useful if the filesystem is clean, or a simple log replay,
is all that is needed, but even fsck -p can be slow, and so delay booting,
if the filesystem in question is a big beast that isn't really essential,
then whatever needs it could wait to start (if it ever starts) until a little
later, while the promary OS functions are running sooner.

I have a gross hack (or more correctly, had on a system I just replaced,
which has not yet been duplicated on the replacement system - but will if
I don't come up with something better) where the most problematic filesystem
was simply listed as noauto, and fsck pass 0, in fstab, then a script run
later (could be from @boot in crontab, or backgrounded from rc.local, or
anything like that) fsck'd it and mounted if OK, or simply newfs'd and mounted
if it was damaged - it gets mounted -o async so if it is active when the
system crashes, it is expected to be damaged - the damage is fine, hanging
the boot because of it would not be.

Because of this kind of thing, the "extended fstab" really needs a way to
specify a recovery script, that is run if the normal fsck -p recovery indicates
failure.  I could even imagine that being used for essential filesystems.
On a system where uptime is crucial, if an essential filesystem is damaged,
I could see a script that would newfs and restore from a network fileserver
(a locally tailored script of course) recovering the most recent backup, for
situations where "run now" is more important than losing the minimal possible
data,

kre



Home | Main Index | Thread Index | Old Index