Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: HEADS UP: panic behaviour changed



On Sun, 1 Feb 2009, Robert Elz wrote:

> As I have said before, I don't really care which way the default
> for ddb.onpanic is set, but ...
> 
>   | 1) Not everyone runs X.  Most servers use a serial console.
> 
> Forget X when discussing this issue, X isn't an argument for anything,
> one way or the other.   By the time X gets anywhere near the system,
> sysctl.conf has run, and the local system owner can trivially decide
> which behaviour works for them and insert the relevant line into
> sysctl.conf.

If you check the original email, X was the justification for making this 
change.  It's a bogus justification, but I don't think we can ignore it.

>   | 3) There is a period of time between loading the kernel and when the rc 
>   | scripts run where you can't tweak the ddb_onpanic value.
> 
> Yes, this is why the kernel needs a default value, and why we can
> sensibly discuss what that default value should be.  By itself it
> doesn't say anything about which particular value should be the
> default however.

The only time the default setting of this value is important is between 
the time the power is turned on and the time the rc script is run that can 
change the sysctl value.  As soon as that happens it is no longer defalt 
behavior but whatever the sysadmin managing the system desires.

>   | 4) If the machine panics early, say during device configuration due to 
>   | broken hardware, you don't really want it to attempt to reboot, since 
> that 
>   | will result in an infinite reboot loop.
> 
> Yes, perhaps - it depends upon the cause of the panic, but this
> can certainly happen.   But I'm not sure this is any worse (or
> better) than the infinite loop the kernel is sitting in waiting
> for a reply to the db> prompt.   Both require user interaction,
> and nothing proceeds until a user has done something to alter the
> state of the system.

Well, no.  If the system drops to the db> prompt, then it requires user 
intervention.  Presumably, all the information about the cause of the 
panic is also sitting there on the screen and has not scrolled off so the 
admin can make an intelligent decision about what the corrective action 
should be.  

If, on the other hand the system is left to attempt to dump core and then 
try an automatic reboot you have a lot of potentially distructive 
operations that could happen.  

Each time the system tries to reboot there  will be a set of resets and 
possibly power-cycles.  Excessive resets or powercycling can potentially 
damage integrated circuits through thermal cycling or disks though 
spin-up/spin-down cycles.  
 
> As an alternative, if the system panics due to a corrupted filesystem
> that was incorrectly marked clean, then ddb is of no practical use
> and a reboot will detect the unclean filesys and fsck (and either
> fix, or at least tell the user what the problem is).

We are talking very early in the boot process.  I have never seen the case 
where a filesystem is so corrupt that fsck is able to clean it but the 
kernel still takes a panic after fsck runs.  It used to be that if fsck 
fixed certain problems in the root filesystem the rc scripts woult 
automatically reboot the system.  I assume that's still the case and a 
reboot won't stop at the db> prompt.

OTOH, if you keep running fsck only part way on the filesystem, you may 
end up doing irreparable damage to it.

And if the system manages to mount the filesystem and run savecore each 
time before it gets to the panic, you end up filling up the root 
filesystem with a series of useless coredumps.

Finally, if the system is suck in a panic loop, how do you diagnose the 
problem?  The system boots, prints a panic message, and then it resets 
itself and starts printing the firmware messages which cause the panic 
message to scroll off the screen.  I suppose if you're lucky and you can 
convince the machine to get into single-user mode, you can manually set 
ddb_on_panic=1 and then switch to multi-user mode to continue diagnosis.
But if you can't get to the single-user shell you are SOL and probably 
won't be able to figure out what's causing the problem let alone how to 
fix it.

Eduardo


Home | Main Index | Thread Index | Old Index