Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: How is savecore meant to work? Is it possible ?



What do you get for: ident `which savecore`

On Tue, Nov 02, 2010 at 10:56:49AM +0700, Robert Elz wrote:
> Some of you may remember the thread "NetBSD 5.1 RC3 in production" from
> late September (around Sep 20 seems to have been the crux of it all),
> where I commented that I considered 5.* less stable than 4.* (which had
> been rock solid for me).
> 
> Part of the "problem" (well, difficulty diagnosing the problem really)
> was that when I upgraded to NetBSD 5, I left all the old NetBSD 4 stuff
> around, and just installed NetBSD 5 (5.1_RC3 initially I think, maybe RC2) 
> on an empty raid5 (raid level 5  using raidFrame I mean) partition I had
> lying around - so swap was on raid5, and no dumps were getting done (this is
> a production system, the crashes happen at any random time of the day,
> regardless of whether I am there or not, so it cannot have DDB enabled - it
> needs to auto-reboot ASAP).
> 
> You may remember that eventually ...
> 
> kre%munnari.OZ.AU@localhost said:
>   | Anyway, I now have dumps configured on a wdNx type partition, and we'll 
> see
>   | what happens next time I get a crash (but that may be days, weeks, or even
>   | months, away). 
> 
> Well, that was yesterday...   Twice in fact.   The first time was while I
> was present watching (I'll describe the symptoms as I observed it below),
> then again, in the early hours of the morning, while I was asleep.
> 
> Both times (perhaps) successfully saved a kernel core dump - the most
> recent one should still be there, assuming it was made correctly,
> as the dump partition is used for nothing else (it is not swap space).
> 
> This is from /var/log/messages for the first crash ...
> 
> Nov  1 12:28:36 jade syslogd: restart
> Nov  1 12:28:36 jade /netbsd: panic: lock error
> Nov  1 12:28:36 jade /netbsd:
> Nov  1 12:28:36 jade /netbsd: dumping to dev 0,524303 offset 246718983
> Nov  1 12:28:36 jade /netbsd: dump Copyright (c) 1996, 1997, 1998, 1999, 
> 2000, 2001, 2002, 2003, 2004, 2005,
> Nov  1 12:28:36 jade /netbsd: 2006, 2007, 2008, 2009, 2010
> Nov  1 12:28:36 jade /netbsd: The NetBSD Foundation, Inc.  All rights 
> reserved.
> Nov  1 12:28:36 jade /netbsd: Copyright (c) 1982, 1986, 1989, 1991, 1993
> Nov  1 12:28:36 jade /netbsd: The Regents of the University of California.  
> All rights reserved.
> Nov  1 12:28:36 jade /netbsd:
> Nov  1 12:28:36 jade /netbsd: NetBSD 5.1_RC4 (JADE-1.12-20100917) #3: Sat Sep 
> 18 02:56:13 ICT 2010
> 
> And this one is the second crash ...
> 
> Nov  2 04:18:02 jade /netbsd: panic: lock error
> Nov  2 04:18:02 jade /netbsd:
> Nov  2 04:18:02 jade /netbsd: dumping to dev 0,524303 offset 246718983
> Nov  2 04:18:02 jade /netbsd: dump Copyright (c) 1996, 1997, 1998, 1999, 
> 2000, 2001, 2002, 2003, 2004, 2005,
> Nov  2 04:18:02 jade /netbsd: 2006, 2007, 2008, 2009, 2010
> Nov  2 04:18:02 jade /netbsd: The NetBSD Foundation, Inc.  All rights 
> reserved.
> Nov  2 04:18:02 jade /netbsd: Copyright (c) 1982, 1986, 1989, 1991, 1993
> Nov  2 04:18:02 jade /netbsd: The Regents of the University of California.  
> All rights reserved.
> Nov  2 04:18:02 jade /netbsd:
> Nov  2 04:18:02 jade /netbsd: NetBSD 5.1_RC3 (JADE-1.12-20100614) #2: Mon Jun 
> 14 08:35:26 ICT 2010
> 
> (dates are UTC+0700 if that matters - and will be from when syslog restarts
> after the reboot/fsck, the actual crash would have been 10-20 minutes
> earlier).
> 
> You may notice there, that the first time, when I was present, I took the
> opportunity to upgrade the kernel to NetBSD 5.1_RC4 (the kernel that had
> been running was 5.1_RC3) - I had been waiting on the next time the system
> was down to boot the new kernel.
> 
> You may also see that it went back to 5.1_RC3 when it rebooted in the early
> hours of this morning - just because I hadn't altered what was /netbsd yet
> (I like to run a new kernel for a while before I commit to booting it
> every time).   Note: in this case that was just forgetfulness, I would have
> switched had I remembered to do it...
> 
> In any case, now I have this kernel core dump from 5.1_RC4, the system
> is currently running 5.1_RC3, and I need to somehow get the core dump from
> the dumpdev partition (which is wd1p if decoding that device number was
> just a little challenging...) and into the filesystem so I can figure
> out just what lock had the error ...
> 
> The question right now is how I run savecore in these circumstances to
> recover that crash
> 
> jade# savecore -n -v -N /netbsd.3
> dumplo = 965342658560 (1885434880 * 512)
> savecore: can't find device 1248/787960
> 
> Oops ... (/netbsd.3 is the 5.1_RC4 kernel)    Of course, there's no reason
> that should work from what I can tell looking in savecore.c, /netbsd.3
> has absolutely nothing in it that would indicate what device dumps are
> made to, that's set in fstab ...
> 
> jade# grep dp /etc/fstab
> /dev/wd1p none swap dp 0 0
> 
> and as best I can see, savecore never goes near fstab to extract that
> information from it, so in this scenario there's simply no way for
> savecore to figure out what the dump device is, is there?
> 
> If I omit the -N ...
> 
> jade# savecore -n -v
> dumplo = 126320119296 (246718983 * 512)
> savecore: magic number mismatch (0x3a294449 != 0x8fca0101)
> savecore: no core dump
> 
> Now it finds dumpdev OK (it will get that out of /dev/mem, where it
> was correctly set by a swapon() (or something) sys call during the boot
> process, using the info from fstab).   But here the dump magic number is
> incorrect - which very possibly is because the kernel running and the
> kernel that crashed are not the same thing, the symbol tables will
> certainly be different.   (Or it may be that the dump was not successful,
> it's kind of hard to tell at the minute.)
> 
> Now I certainly appreciate that savecore attempts to be smart, and simply
> find the dump, but does it also have to be arrogant, and refuse to be told?
> That is, it has no args to allow me to tell it that I know where the dump
> is (or where it is supposed to be anyway) and have it avoid its detective
> work and simply do what it is told.
> 
> I can't be the first person who's needed to extract a core file while
> running a different kernel than the one that crashed (that must happen all
> the time when developing new stuff and making unstable kernels), but is
> it possible that no-one else in that situation has ever used anything but
> the default dumpdev ?
> 
> Any suggestions, or do I need to go hack on savecore (which looks like it
> might need to turn into major surgery from what I saw) to make it do what it
> is told?
> 
> And last I said above I'd describe what I observed when it crashed the
> first time.  You can stop reading here if you like - nothing very relevant
> follows ...
> 
> First, I had the system busier than it has been in a while, it was rebuilding
> (updating) NetBSD 5 version binary packages after the past day or so's
> upgrades (that's not at all unusual, and happens most days).   That is just
> rebuilding the binary packages (using pkg_comp) not installing anything.
> 
> And in parallel, I'd just started doing a set of NetBSD 4 based pkgsrc
> building (that was still in the process of working out the correct order in
> which to rebuild everything - hundreds of packages - I hadn't touched them
> since early September).  That was also using pkg_comp (different sandbox of
> course.)
> 
> While all that was going on, I found an old .doc ("word") file I thought
> I'd like to read, so I started openoffice.   That got as far as putting up
> its silly advertising window (or the one to keep the audience occupied while
> it does everything it needs to get started, or whatever you want to call it)
> when X froze completely - no window updates of any kind, I don't recall if
> the mouse cursor moved, but I think not.    At first, I thought that the
> system had crashed then, but it was still fine (I think) - I had no
> problem connecting via ssh from my laptop, and looking around.
> 
> The NetBSD 5 pkgsrc building was still proceeding, and successfully making
> binary packages, and the NetBSD 4 one was still working through extracting
> the dependency lists from everything that needed to be upgraded, so tsort
> could put them in the right order.
> 
> At that stage I just decided "Ok, X is borked, no big deal, I can restart
> that later" - the window system is not critical to that system, it
> actually spends much of its time with the monitor powered off.   I didn't
> want to kill it (X) right then, as the pkgsrc builds were happening in xterm
> windows, and the NetBSD 4 one at least would have lost all its work (it had
> probably been running the best part of an hour - and yes, it really does
> take that long to dig out all the info when the set of packages is that
> big - or at least it does using my script...)   I did try to kill the
> running openoffice process, on the off chance that would unglue things,
> but it didn't.
> 
> About 20 minutes or so after X hung, I noticed the monitor had reset
> itself, and the system was rebooting (that's where I interrupted the boot
> process and selected the 5.1_RC4 kernel from /netbsd.3)   Obviously, that
> was after the crash happened...   Since the monitor (console) was in X mode
> (and frozen solid) when the system crashed (and since I was using my laptop
> and not gazing at the static monitor display anyway) I saw nothing of what
> it might have attempted to tell me there - what's in the messages file,
> and what is in the saved core file (which now will be from the later crash
> of course) is all the evidence that exists.   Hence the desire to extract
> that core file.
> 
> In the early hours, when it crashed again, it would have been rebuilding
> NetBSD 4 packages (by the time this happened, it was well past the stage
> of working out the order, which I started again after the first reboot, and
> had been building stuff for many hours - 14 or so).   X would have been
> running, but doing nothing interesting (monitor powered off).   No more
> NetBSD 5 package building in parallel, that finished (I did restart that
> after the reboot as well, turns out it was nearly done anyway).  Aside
> from that, just regular network related stuff (e-mail, dns, ftp server).
> 
> kre
> 


Home | Main Index | Thread Index | Old Index