Subject: panic: pool_get(mcpl): free list modified...
To: None <current-users@netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: current-users
Date: 06/16/2000 03:47:54
I've blundered into a repeatable panic, but I don't have much info.  

Sorry this is so long and fuzzy, but here goes... 


I've got a big pile of network testing/bashing scripts based on a 
hacked version of Netperf.  The full set takes roughly 24 hours.  
I have a pair of Alphas and a pair of Celerons full of network cards. 

I try to run a lot of tests without getting into the N-squared game.  
If I had time, I would do Tulip-crossover-Tulip, Tulip-switch-Tulip, 
and Tulip-hub-Tulip; and 82558-crossover-82558, 82558-switch-82558, 
and 82558-hub-82558.  I run that sort of test on each pair of CPUs 
I can get.  I don't do anything like Tulip-xxx-82558 or Alpha-Celeron 
since the combinations explode.  

If I try to run these tests on Alphas (600au) running 1.4Z over 82558s 
(fxp driver), I get a panic after several hours.  (Usually it happens 
an hour after I go home. :)  I just got the 6th panic a few minutes 
ago.  That test had been running under an hour. 

I haven't found any simple way to provoke this.  I've never been 
able to run longer than 4 or 5 hours. 

I didn't have any troubles when I run these tests on FDDI, Alteon 
Gigabit Ethernet, or Tulips.  (That's on Alphas running 1.4Z) 

On 400 MHz Celerons, I've run these tests on Tulips and 82558s with 
no troubles. 

I didn't have any troubles like this with 1.4.2 on Alpha or i386. 

Has anybody else seen this?


How serious do people consider this type of problem? 

I'm deliberately running stupid/nasty tests.  I'm trying lots of 
different message sizes and different TCP window sizes and various 
usage patterns.  I'm fishing for trouble.  Frequently I find performance 
quirks or other glitches like the lost clock interrupts.  Occasionally 
I find serious problems like this. 

This might not happen with a "normal" usage pattern.  But timing 
bugs are often easy to trigger after you understand how to do it.  
Maybe something as simple as disk activity or paging would make this 
happen more often.

The run that just crashed hadn't done anything nasty - just TCP traffic.  
(The wire was busy, but that's not unreasonable.)

Should I send a PR?  Above is all the data I have. 

I don't want to send in anything if it doesn't have enough info to 
be useful, but I don't want this to slip into 1.5 because nobody 
knew there was a problem. 

Is anybody else seeing "fxp2: device timeout" type errors?

I get them on both machines, but it's always the "server" that crashes.  
(It has 512MB of memory vs 256.) 

------

Here are some second-order problems I've encountered while trying 
to track this down.

None of this is on my critical path.  Please be sure not to interpret 
any of this in a negative way.  I'm not trying to gripe but rather 
fishing for info and/or reporting the problems I encountered in hopes 
that they might get fixed.  I'll send PRs if anybody thinks something 
deserves one. 


1) I'm getting a sore head from trying to deal with ddb.  Are mangled 
pools known to be an ugly case?  Does ddb have similar troubles on 
i386? 

2) I hit the first problem when I typed "reboot".  I was expecting 
it to do the right thing and leave me with a dump to investigate.  
The first thing that reboot does is try to turn off the network drivers 
so an arriving packet doesn't trash memory.  That's clearly a good 
idea, but I think the cleanup code just calls the general driver 
shutdown code which returns the buffers rather than simply resetting 
the hardware.  The buffer pool panics again dumping me back into 
ddb.

3) What happens if I've got the go-to-ddb-on-panic flag turned off 
or don't have ddb included in my CONFIG?   Will it loop?

4) Somebody suggested that I try "call cpu_reboot(0x104)".  That 
takes a dump (yea!) but then it gets the recursive panic.  Is there 
anything I can do at that point?  I've been power cycling. 

5) Would somebody familiar with this area please check the description  
of the reboot command in the ddb man page.  I was working on an Alpha 
and I read the "flags" text to mean the text I would have typed on 
the boot line at the >>> prompt.  I think a line or two of hints 
about what the flags do or a pointer to the header file would have 
helped me a lot.  (There is a comment about not being able to specify 
the "boot string", but I didn't understand that because the SRM code 
calls that string "flags".) 

6) Does "reboot" take a dump by default?  If not, should there be 
a simple dump-and-reboot type command? 


7) With a mashed free list, is there any point in trying to get info 
out of the core dump?  Did the real bug happen ages ago and only 
now did the time bomb go off? 

8) I built a kernel with makeoptions    DEBUG="-g"

Below is as far as I can get with gdb.  What do I type to get more 
info?  (I'm far from a gdb wizard.) 


mckinley# gdb /usr/src/sys/arch/alpha/compile/MIATA/netbsd.gdb
GNU gdb 4.17
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "alpha--netbsd"...
(gdb) target kcore netbsd.4.core
panic: pool_get(%s): free list modified: magic=%x; page %p; item addr %p

#0  0xfffffc0000426eb8 in dumpsys ()
    at ../../../../arch/alpha/alpha/machdep.c:1294
1294            savectx(&dumppcb);
(gdb) where
#0  0xfffffc0000426eb8 in dumpsys ()
    at ../../../../arch/alpha/alpha/machdep.c:1294
#1  0xfffffc0000426acc in cpu_reboot (howto=260, bootstr=0x0)
    at ../../../../arch/alpha/alpha/machdep.c:1113
#2  0xfffffc00003255dc in db_reboot_cmd ()
#3  0xfffffc0000324fac in db_command ()
#4  0xfffffc000032530c in db_command_loop ()
#5  0xfffffc0000329b14 in db_trap ()
#6  0xfffffc0000434624 in ddb_trap ()
#7  0xfffffc0000300194 in alpha_debug ()
    at ../../../../arch/alpha/alpha/debug.s:101
#8  0xfffffc000042ebc4 in trap ()
#9  0xfffffc00003003b0 in XentIF ()
    at ../../../../arch/alpha/alpha/locore.s:525
#10 0xfffffc000034f544 in panic ()
warning: Hit beginning of text section without finding
warning: enclosing function for address 0x4
This warning occurs if you are debugging a function without any symbols
(for example, in a stripped executable).  In that case, you may wish to
increase the size of the search with the `set heuristic-fence-post' command.

Otherwise, you told GDB there was a function where there isn't one, or
(more likely) you have encountered a bug in GDB.
(gdb)