Subject: Re: Watchdog timer support
To: None <thorpej@zembu.com>
From: Chris G. Demetriou <cgd@sibyte.com>
List: tech-userlevel
Date: 11/13/2000 16:22:25
thorpej@zembu.com (Jason R Thorpe) writes:
> See the wdogctl(8) manual page for more information.

Just wondering, were these discussed on tech-kern or tech-userlevel at
all?  I didn't notice any, and no list searches on our archive showed
anything.  (i tried a few things, and will admit that both my queries
weren't horribly extensive, and, as far as I can tell, the search is
slightly broken. 8-S)

I'm wondering, because this implementation seems to be quite different
than what i'd call a "good" implementation, at least in terms of the
"user" interface.  It would have benefitted from discussion...


I've looked briefly at the manual page and the code, and, in at least
some ways, this interface seems Wrong.  What you "Really Want" in my
opinion is something more like:

	* configure kernel to kick hw watchdog every (small amount of
	  time).  (Fraction of a second, to multiple seconds if not
	  running with next feature enabled.)

	* optionally configure kernel to expect kick from userland
	  every (larger amount of time).  (multiple seconds, or
	  whatever you'd like.)

That way you've got layering.  If your kernel is Really Dead for at
least (small amount of time), the watchdog barks.  Similarly, if your
userland kicker doesn't kick within the given time, same happens.

You can implement just the latter half in software only, as a partial
measure, if the system doesn't have any watchdog hardware.

The simpler the kicking, the better -- opening the file descriptor,
writing a byte, etc., make it _really_ easy to user from shell
scripts, but in the extreme running a program could be OK.

Features which can then be added on top of that:

	* the ability to have a the kicker program automatically do
	kicking from userland periodically, as wdogctl now seems
	to do.  That's a convenience, and could be emulated with
	a trivial shell script!

	* the ability to have a two-stage timeout, where a script is
	run or other userland action taken after a first userland
	timeout, then the system rebooted after a second.

	* the ability to tell the driver to never allow the kernel
	watchdog to be disabled.  (or some variation on that theme;
	our reboot process needs ... help to work in that kind of
	world.)

Unfortunately, having given the code and manual page a quick look,
this fairly simple model seems _impossible_ to create with the current
implementation.  That's _really_ unfortunate.

I think it'd be unfortunate to "standardize" this ... in my opinion
deficient interface as-is.


(BTW, I've come to the conclusion about what "you Really Want" from
watching several coworkers implement this type of thing several times
over the years, and hearing their rationales for why they do it...
8-)



chris