Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

How is savecore meant to work? Is it possible ?



Some of you may remember the thread "NetBSD 5.1 RC3 in production" from
late September (around Sep 20 seems to have been the crux of it all),
where I commented that I considered 5.* less stable than 4.* (which had
been rock solid for me).

Part of the "problem" (well, difficulty diagnosing the problem really)
was that when I upgraded to NetBSD 5, I left all the old NetBSD 4 stuff
around, and just installed NetBSD 5 (5.1_RC3 initially I think, maybe RC2) 
on an empty raid5 (raid level 5  using raidFrame I mean) partition I had
lying around - so swap was on raid5, and no dumps were getting done (this is
a production system, the crashes happen at any random time of the day,
regardless of whether I am there or not, so it cannot have DDB enabled - it
needs to auto-reboot ASAP).

You may remember that eventually ...

kre%munnari.OZ.AU@localhost said:
  | Anyway, I now have dumps configured on a wdNx type partition, and we'll see
  | what happens next time I get a crash (but that may be days, weeks, or even
  | months, away). 

Well, that was yesterday...   Twice in fact.   The first time was while I
was present watching (I'll describe the symptoms as I observed it below),
then again, in the early hours of the morning, while I was asleep.

Both times (perhaps) successfully saved a kernel core dump - the most
recent one should still be there, assuming it was made correctly,
as the dump partition is used for nothing else (it is not swap space).

This is from /var/log/messages for the first crash ...

Nov  1 12:28:36 jade syslogd: restart
Nov  1 12:28:36 jade /netbsd: panic: lock error
Nov  1 12:28:36 jade /netbsd:
Nov  1 12:28:36 jade /netbsd: dumping to dev 0,524303 offset 246718983
Nov  1 12:28:36 jade /netbsd: dump Copyright (c) 1996, 1997, 1998, 1999, 2000, 
2001, 2002, 2003, 2004, 2005,
Nov  1 12:28:36 jade /netbsd: 2006, 2007, 2008, 2009, 2010
Nov  1 12:28:36 jade /netbsd: The NetBSD Foundation, Inc.  All rights reserved.
Nov  1 12:28:36 jade /netbsd: Copyright (c) 1982, 1986, 1989, 1991, 1993
Nov  1 12:28:36 jade /netbsd: The Regents of the University of California.  All 
rights reserved.
Nov  1 12:28:36 jade /netbsd:
Nov  1 12:28:36 jade /netbsd: NetBSD 5.1_RC4 (JADE-1.12-20100917) #3: Sat Sep 
18 02:56:13 ICT 2010

And this one is the second crash ...

Nov  2 04:18:02 jade /netbsd: panic: lock error
Nov  2 04:18:02 jade /netbsd:
Nov  2 04:18:02 jade /netbsd: dumping to dev 0,524303 offset 246718983
Nov  2 04:18:02 jade /netbsd: dump Copyright (c) 1996, 1997, 1998, 1999, 2000, 
2001, 2002, 2003, 2004, 2005,
Nov  2 04:18:02 jade /netbsd: 2006, 2007, 2008, 2009, 2010
Nov  2 04:18:02 jade /netbsd: The NetBSD Foundation, Inc.  All rights reserved.
Nov  2 04:18:02 jade /netbsd: Copyright (c) 1982, 1986, 1989, 1991, 1993
Nov  2 04:18:02 jade /netbsd: The Regents of the University of California.  All 
rights reserved.
Nov  2 04:18:02 jade /netbsd:
Nov  2 04:18:02 jade /netbsd: NetBSD 5.1_RC3 (JADE-1.12-20100614) #2: Mon Jun 
14 08:35:26 ICT 2010

(dates are UTC+0700 if that matters - and will be from when syslog restarts
after the reboot/fsck, the actual crash would have been 10-20 minutes
earlier).

You may notice there, that the first time, when I was present, I took the
opportunity to upgrade the kernel to NetBSD 5.1_RC4 (the kernel that had
been running was 5.1_RC3) - I had been waiting on the next time the system
was down to boot the new kernel.

You may also see that it went back to 5.1_RC3 when it rebooted in the early
hours of this morning - just because I hadn't altered what was /netbsd yet
(I like to run a new kernel for a while before I commit to booting it
every time).   Note: in this case that was just forgetfulness, I would have
switched had I remembered to do it...

In any case, now I have this kernel core dump from 5.1_RC4, the system
is currently running 5.1_RC3, and I need to somehow get the core dump from
the dumpdev partition (which is wd1p if decoding that device number was
just a little challenging...) and into the filesystem so I can figure
out just what lock had the error ...

The question right now is how I run savecore in these circumstances to
recover that crash

jade# savecore -n -v -N /netbsd.3
dumplo = 965342658560 (1885434880 * 512)
savecore: can't find device 1248/787960

Oops ... (/netbsd.3 is the 5.1_RC4 kernel)    Of course, there's no reason
that should work from what I can tell looking in savecore.c, /netbsd.3
has absolutely nothing in it that would indicate what device dumps are
made to, that's set in fstab ...

jade# grep dp /etc/fstab
/dev/wd1p none swap dp 0 0

and as best I can see, savecore never goes near fstab to extract that
information from it, so in this scenario there's simply no way for
savecore to figure out what the dump device is, is there?

If I omit the -N ...

jade# savecore -n -v
dumplo = 126320119296 (246718983 * 512)
savecore: magic number mismatch (0x3a294449 != 0x8fca0101)
savecore: no core dump

Now it finds dumpdev OK (it will get that out of /dev/mem, where it
was correctly set by a swapon() (or something) sys call during the boot
process, using the info from fstab).   But here the dump magic number is
incorrect - which very possibly is because the kernel running and the
kernel that crashed are not the same thing, the symbol tables will
certainly be different.   (Or it may be that the dump was not successful,
it's kind of hard to tell at the minute.)

Now I certainly appreciate that savecore attempts to be smart, and simply
find the dump, but does it also have to be arrogant, and refuse to be told?
That is, it has no args to allow me to tell it that I know where the dump
is (or where it is supposed to be anyway) and have it avoid its detective
work and simply do what it is told.

I can't be the first person who's needed to extract a core file while
running a different kernel than the one that crashed (that must happen all
the time when developing new stuff and making unstable kernels), but is
it possible that no-one else in that situation has ever used anything but
the default dumpdev ?

Any suggestions, or do I need to go hack on savecore (which looks like it
might need to turn into major surgery from what I saw) to make it do what it
is told?

And last I said above I'd describe what I observed when it crashed the
first time.  You can stop reading here if you like - nothing very relevant
follows ...

First, I had the system busier than it has been in a while, it was rebuilding
(updating) NetBSD 5 version binary packages after the past day or so's
upgrades (that's not at all unusual, and happens most days).   That is just
rebuilding the binary packages (using pkg_comp) not installing anything.

And in parallel, I'd just started doing a set of NetBSD 4 based pkgsrc
building (that was still in the process of working out the correct order in
which to rebuild everything - hundreds of packages - I hadn't touched them
since early September).  That was also using pkg_comp (different sandbox of
course.)

While all that was going on, I found an old .doc ("word") file I thought
I'd like to read, so I started openoffice.   That got as far as putting up
its silly advertising window (or the one to keep the audience occupied while
it does everything it needs to get started, or whatever you want to call it)
when X froze completely - no window updates of any kind, I don't recall if
the mouse cursor moved, but I think not.    At first, I thought that the
system had crashed then, but it was still fine (I think) - I had no
problem connecting via ssh from my laptop, and looking around.

The NetBSD 5 pkgsrc building was still proceeding, and successfully making
binary packages, and the NetBSD 4 one was still working through extracting
the dependency lists from everything that needed to be upgraded, so tsort
could put them in the right order.

At that stage I just decided "Ok, X is borked, no big deal, I can restart
that later" - the window system is not critical to that system, it
actually spends much of its time with the monitor powered off.   I didn't
want to kill it (X) right then, as the pkgsrc builds were happening in xterm
windows, and the NetBSD 4 one at least would have lost all its work (it had
probably been running the best part of an hour - and yes, it really does
take that long to dig out all the info when the set of packages is that
big - or at least it does using my script...)   I did try to kill the
running openoffice process, on the off chance that would unglue things,
but it didn't.

About 20 minutes or so after X hung, I noticed the monitor had reset
itself, and the system was rebooting (that's where I interrupted the boot
process and selected the 5.1_RC4 kernel from /netbsd.3)   Obviously, that
was after the crash happened...   Since the monitor (console) was in X mode
(and frozen solid) when the system crashed (and since I was using my laptop
and not gazing at the static monitor display anyway) I saw nothing of what
it might have attempted to tell me there - what's in the messages file,
and what is in the saved core file (which now will be from the later crash
of course) is all the evidence that exists.   Hence the desire to extract
that core file.

In the early hours, when it crashed again, it would have been rebuilding
NetBSD 4 packages (by the time this happened, it was well past the stage
of working out the order, which I started again after the first reboot, and
had been building stuff for many hours - 14 or so).   X would have been
running, but doing nothing interesting (monitor powered off).   No more
NetBSD 5 package building in parallel, that finished (I did restart that
after the reboot as well, turns out it was nearly done anyway).  Aside
from that, just regular network related stuff (e-mail, dns, ftp server).

kre



Home | Main Index | Thread Index | Old Index