Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: HEADS UP: panic behaviour changed



Eduardo Horvath wrote:

> On Sun, 1 Feb 2009, Robert Elz wrote:
> 
> > As I have said before, I don't really care which way the default
> > for ddb.onpanic is set, but ...
> > 
> >   | 1) Not everyone runs X.  Most servers use a serial console.
> > 
> > Forget X when discussing this issue, X isn't an argument for anything,
> > one way or the other.   By the time X gets anywhere near the system,
> > sysctl.conf has run, and the local system owner can trivially decide
> > which behaviour works for them and insert the relevant line into
> > sysctl.conf.
> 
> If you check the original email, X was the justification for making this 
> change.  It's a bogus justification, but I don't think we can ignore it.
> 
> >   | 3) There is a period of time between loading the kernel and when the rc 
> >   | scripts run where you can't tweak the ddb_onpanic value.
> > 
> > Yes, this is why the kernel needs a default value, and why we can
> > sensibly discuss what that default value should be.  By itself it
> > doesn't say anything about which particular value should be the
> > default however.
> 
> The only time the default setting of this value is important is between 
> the time the power is turned on and the time the rc script is run that can 
> change the sysctl value.  As soon as that happens it is no longer defalt 
> behavior but whatever the sysadmin managing the system desires.
> 
> >   | 4) If the machine panics early, say during device configuration due to 
> >   | broken hardware, you don't really want it to attempt to reboot, since 
> > that 
> >   | will result in an infinite reboot loop.
> > 
> > Yes, perhaps - it depends upon the cause of the panic, but this
> > can certainly happen.   But I'm not sure this is any worse (or
> > better) than the infinite loop the kernel is sitting in waiting
> > for a reply to the db> prompt.   Both require user interaction,
> > and nothing proceeds until a user has done something to alter the
> > state of the system.
> 
> Well, no.  If the system drops to the db> prompt, then it requires user 
> intervention.  Presumably, all the information about the cause of the 
> panic is also sitting there on the screen and has not scrolled off so the 
> admin can make an intelligent decision about what the corrective action 
> should be.  
> 
> If, on the other hand the system is left to attempt to dump core and then 
> try an automatic reboot you have a lot of potentially distructive 
> operations that could happen.  
> 
> Each time the system tries to reboot there  will be a set of resets and 
> possibly power-cycles.  Excessive resets or powercycling can potentially 
> damage integrated circuits through thermal cycling or disks though 
> spin-up/spin-down cycles.  
>  
> > As an alternative, if the system panics due to a corrupted filesystem
> > that was incorrectly marked clean, then ddb is of no practical use
> > and a reboot will detect the unclean filesys and fsck (and either
> > fix, or at least tell the user what the problem is).
> 
> We are talking very early in the boot process.  I have never seen the case 
> where a filesystem is so corrupt that fsck is able to clean it but the 
> kernel still takes a panic after fsck runs.  It used to be that if fsck 
> fixed certain problems in the root filesystem the rc scripts woult 
> automatically reboot the system.  I assume that's still the case and a 
> reboot won't stop at the db> prompt.
> 
> OTOH, if you keep running fsck only part way on the filesystem, you may 
> end up doing irreparable damage to it.
> 
> And if the system manages to mount the filesystem and run savecore each 
> time before it gets to the panic, you end up filling up the root 
> filesystem with a series of useless coredumps.

Please, this is only a very near corner case. If you boot a new kernel you are
on-site or having ILO access anyway, rebooting a system without that
always have been a hazard play.

> 
> Finally, if the system is suck in a panic loop, how do you diagnose the 
> problem?  The system boots, prints a panic message, and then it resets 
> itself and starts printing the firmware messages which cause the panic 
> message to scroll off the screen.  I suppose if you're lucky and you can 
> convince the machine to get into single-user mode, you can manually set 
> ddb_on_panic=1 and then switch to multi-user mode to continue diagnosis.
> But if you can't get to the single-user shell you are SOL and probably 
> won't be able to figure out what's causing the problem let alone how to 
> fix it.

This is why we have 'boot -c'.

--
When in doubt, use brute force.

Adam Hoka <ahoka%NetBSD.org@localhost>
Adam Hoka <ahoka%MirBSD.de@localhost>
Adam Hoka <adam.hoka%gmail.com@localhost>

Attachment: pgpP9O2eMBh8q.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index