current-users: Re: panic: pool_get(mcpl): free list modified...

Subject: Re: panic: pool_get(mcpl): free list modified...
To: Hal Murray <murray@pa.dec.com>
From: John Hawkinson <jhawk@MIT.EDU>
List: current-users
Date: 06/16/2000 11:41:53
| I've blundered into a repeatable panic, but I don't have much info.  
| 
| Sorry this is so long and fuzzy, but here goes... 

This is just a partial stab at some things.

| Should I send a PR?  Above is all the data I have. 

Yee. Any panic should get a PR if you can't fix it immediately.

| 1) I'm getting a sore head from trying to deal with ddb.  Are mangled 
| pools known to be an ugly case?  Does ddb have similar troubles on 
| i386? 

I don't understand the problem here.

| 2) I hit the first problem when I typed "reboot".  I was expecting 
| it to do the right thing and leave me with a dump to investigate.  

No, reboot should merely reboot without dumping. "sync" should
dump without rebooting. ddb(4) could be clearer on this. Would
you care to propose text? ;-)

| The first thing that reboot does is try to turn off the network drivers 
| so an arriving packet doesn't trash memory.  That's clearly a good 
| idea, but I think the cleanup code just calls the general driver 
| shutdown code which returns the buffers rather than simply resetting 
| the hardware.  The buffer pool panics again dumping me back into 
| ddb.

Both "sync" and "reboot" call cpu_reboot() so MD rebooting code runs.

| 3) What happens if I've got the go-to-ddb-on-panic flag turned off 
| or don't have ddb included in my CONFIG?   Will it loop?

It should loop for a while and eventually stack overflow
and presumably reboot. At least, that seems to happen on the i386.

| 4) Somebody suggested that I try "call cpu_reboot(0x104)".  That 
| takes a dump (yea!) but then it gets the recursive panic.  Is there 
| anything I can do at that point?  I've been power cycling. 

Well, if cpu_reboot(0x4) 4==RB_NOSYNC doesn't work, then
you're losing somewhere in cpu_reboot(). If it's not due to the dump,
presumably it is due to the shutdown hooks. You could try

	   !prom_halt(0x8)

| 5) Would somebody familiar with this area please check the description  
| of the reboot command in the ddb man page.  I was working on an Alpha 
| and I read the "flags" text to mean the text I would have typed on 
| the boot line at the >>> prompt.  I think a line or two of hints 
| about what the flags do or a pointer to the header file would have 
| helped me a lot.  (There is a comment about not being able to specify 
| the "boot string", but I didn't understand that because the SRM code 
| calls that string "flags".) 

The flags mean the argument to cpu_reboot(). This request is essentially
kern/9544.

| 6) Does "reboot" take a dump by default?  If not, should there be 
| a simple dump-and-reboot type command? 

No, see above ("sync"). Perhaps it is poorly named. I don't know
what to do about that.

| 7) With a mashed free list, is there any point in trying to get info 
| out of the core dump?  Did the real bug happen ages ago and only 
| now did the time bomb go off? 

Hard to know without checking.

| 8) I built a kernel with makeoptions    DEBUG="-g"
| 
| Below is as far as I can get with gdb.  What do I type to get more 
| info?  (I'm far from a gdb wizard.) 
| 
| 
| mckinley# gdb /usr/src/sys/arch/alpha/compile/MIATA/netbsd.gdb
| GNU gdb 4.17
| Copyright 1998 Free Software Foundation, Inc.
| GDB is free software, covered by the GNU General Public License, and you are
| welcome to change it and/or distribute copies of it under certain conditions.
| Type "show copying" to see the conditions.
| There is absolutely no warranty for GDB.  Type "show warranty" for details.
| This GDB was configured as "alpha--netbsd"...
| (gdb) target kcore netbsd.4.core
| panic: pool_get(%s): free list modified: magic=%x; page %p; item addr %p
| 
| #0  0xfffffc0000426eb8 in dumpsys ()
|     at ../../../../arch/alpha/alpha/machdep.c:1294
| 1294            savectx(&dumppcb);
| (gdb) where
| #0  0xfffffc0000426eb8 in dumpsys ()
|     at ../../../../arch/alpha/alpha/machdep.c:1294
| #1  0xfffffc0000426acc in cpu_reboot (howto=260, bootstr=0x0)
|     at ../../../../arch/alpha/alpha/machdep.c:1113
| #2  0xfffffc00003255dc in db_reboot_cmd ()
| #3  0xfffffc0000324fac in db_command ()
| #4  0xfffffc000032530c in db_command_loop ()
| #5  0xfffffc0000329b14 in db_trap ()
| #6  0xfffffc0000434624 in ddb_trap ()
| #7  0xfffffc0000300194 in alpha_debug ()
|     at ../../../../arch/alpha/alpha/debug.s:101
| #8  0xfffffc000042ebc4 in trap ()
| #9  0xfffffc00003003b0 in XentIF ()
|     at ../../../../arch/alpha/alpha/locore.s:525
| #10 0xfffffc000034f544 in panic ()
| warning: Hit beginning of text section without finding
| warning: enclosing function for address 0x4
| This warning occurs if you are debugging a function without any symbols
| (for example, in a stripped executable).  In that case, you may wish to
| increase the size of the search with the `set heuristic-fence-post' command.
| 
| Otherwise, you told GDB there was a function where there isn't one, or
| (more likely) you have encountered a bug in GDB.
| (gdb) 

Well, that's unfortunate. Does DDB give you a stack trace that crosses
the initial panic() when you hit "t"? (Or, if you turn of ddb_onpanic
and let ddb print the stack trace and go its merry way [new feature],
what happens)?

I would say that you should try backtracing the stack by hand starting
with the panic() frame. So perhaps "info frame 10", check your
registers, and start trying to decode the stack by hand, perhaps
following along as db_stack_trace_print() does
(sys/arch/alpha/alpha/db_trace.c).  On the other hand, if "t" from ddb
wasn't helpful, neither is that.  This seems vaguely reminiscent of
port-i386/9367 (gdb tracebacks of i386 kernel crash dumps fail on
trap()), though perhaps it is totally unrelated.

I think you have cleared fodder for a number of PRs here.

--jhawk