tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Small tip proposal for headless systems boot resiliency



Hello,

The headless systems, or boxes that may be far away are a pain to get back 
online if stuck during the boot sequence... (embedded/headless do not have 
IP-KVM or remote control).


* The point is : with a headless system you want it back online, really. Even 
if it needs maintenance, you need a ssh or whatever to get your hands dirty. So 
don't stop the boot sequence. If it's damaged beyond basic/rescue usability, it 
makes no difference anyway.

Example: The stop_boot() function of /etc/rc.subr got it to stop a few time 
here... and these scripts may call it:
grep stop_boot /etc/rc.d/*
/etc/rc.d/ipfilter:             stop_boot
/etc/rc.d/ipsec:                stop_boot
/etc/rc.d/pf:           stop_boot
/etc/rc.d/pf_boot:              stop_boot
(see, these are network related scripts, ones that you may play with remotely 
and get stuck for many reasons, let's tackle this one in particular).

A failed fsck may call it too, but this can be moderated with fsck_flags="-p -y 
-P" in your rc.conf.


* Proposal:
Add a "headless" flag to rc.conf, and alter stop_boot function this way: 

/etc/rc.conf:
headless=yes

diff -u /mnt/sd0d/backup/rc.subr /etc/rc.subr 
--- /mnt/sd0d/backup/rc.subr    2013-12-21 23:29:01.000000000 +0100
+++ /etc/rc.subr        2013-12-31 00:10:18.000000000 +0100
@@ -100,14 +100,24 @@
 # If booting directly to multiuser, send SIGTERM to
 # the parent (/etc/rc) to abort the boot.
 # Otherwise just exit.
+# OR
+# If this is a headless system, just send a warning, pause to give a hint,
+# and try resuming the boot sequence.
 #
 stop_boot()
 {
-       if [ "$autoboot" = yes ]; then
-               echo "ERROR: ABORTING BOOT (sending SIGTERM to parent)!"
-               kill -TERM ${RC_PID}
+       if [ "$headless" = yes ] || [ "$headless" = YES ]; then
+               echo "WARNING: BOOT *SHOULD* HAVE BEEN STOPPED"
+               echo "Resuming boot sequence in 15s, the System may be 
unusable."
+               sleep 30
+               touch /CHECK_BOOT_LOG.warn
+       else
+               if [ "$autoboot" = yes ]; then
+                       echo "ERROR: ABORTING BOOT (sending SIGTERM to parent)!"
+                       kill -TERM ${RC_PID}
+               fi
+               exit 1
        fi
-       exit 1
 }
 

Happy end of year with your unstoppable NetBSD systems ;)
Kind regards,
Mat.


Home | Main Index | Thread Index | Old Index