Subject: kern/35461: pool cache group corruption
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Martin Husemann <martin@duskware.de>
List: netbsd-bugs
Date: 01/21/2007 20:00:00
>Number:         35461
>Category:       kern
>Synopsis:       pool cache group corruption
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jan 21 20:00:00 +0000 2007
>Originator:     Martin Husemann
>Release:        NetBSD 4.99.9
>Organization:
>Environment:
System: NetBSD 4.99.9 (CUBE) #1: Sat Jan 20 23:58:58 CET 2007	martin@night-porter.duskware.de:/usr/src/sys/arch/evbmips/compile/CUBE
Architecture: mips-el
Machine: evbmips
>Description:

I get a pretty reliabel pool cache group corruption panic on a meschcube:

panic: kernel diagnostic assertion "pcg->pcg_objects[idx].pcgo_va != NULL" failed: file "../../../../kern/subr_pool.c", line 1991
Stopped in pid 338.1 (sshd) at  netbsd:cpu_Debugger+0x4:        jr      ra
                bdslot: nop                                               
db> bt                     
8025185c+898 (83fff000,b1100000,0,104) ra 801f0748 sz 0
panic+190 (83fff000,802c4fe4,802d8ab0,802d85c8) ra 802a501c sz 48
__assert+2c (83fff000,802c4fe4,7c7,802d85c8) ra 801ee39c sz 32   
801ee270+12c (83fff000,802c4fe4,7c7,802d85c8) ra 0 sz 0       
User-level: pid 338.1                         

This happens some time during boot to multiuser, with root on NFS. The panic
is not always exactly at the same time.

The traceback seems to be missing an interrupt frame, and while I could not 
exactly identify the affected pool cachelist entry, the ones close to the
corrupt one seem to belong to mbuf pools.

I'm not sure what to make out of this - mbuf use after free? Some spl ordering
bug? I have been unable to reproduce it on other machines.

A kernel from early december works, but ther seems to be no exact commit that
caused this, only a sequence that makes it more likely, so I think this is an
old bug that just happened to not show up often.

I haven't found time to track this further, so I'm at least filing this PR.
I'm of course open to suggestion.

For the record, below is the dmesg.

Martin

Loaded initial symtab at 0x80307fb0, strtab at 0x8031e764, # entries 5653
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
    2006, 2007
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 4.99.9 (CUBE) #1: Sat Jan 20 23:58:58 CET 2007
	martin@night-porter.duskware.de:/usr/src/sys/arch/evbmips/compile/CUBE
4G Systems MTX-1
total memory = 65536 KB
avail memory = 60380 KB
timecounter: Timecounters tick every 1.953 msec
mainbus0 (root)
cpu0 at mainbus0: 324.00MHz (hz cycles = 632813, delay divisor = 324)
cpu0: Alchemy Au1500 (Rev 2 core) (0x1030202) Rev. 2 with software emulated floating point
cpu0: 16KB/32B 4-way set-associative L1 Instruction cache, 32 TLB entries
cpu0: 16KB/32B 4-way set-associative write-back L1 Data cache
obio0 at mainbus0
aubus0 at mainbus0
com0 at aubus0 addr 0x11100000 irq 0: Au1X00 UART, working fifo
com0: console
com1 at aubus0 addr 0x11400000 irq 3: Au1X00 UART, working fifo
aurtc0 at aubus0: Au1X00 programmable clock
aumac0 at aubus0 addr 0x11500000 irq 28: Au1X00 10/100 Ethernet
aumac0: Ethernet address 00:0e:56:00:02:2c
sqphy0 at aumac0 phy 31: Seeq 84220 10/100 media interface, rev. 0
sqphy0: using Seeq 84220 isolate/reset hack
sqphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
aumac1 at aubus0 addr 0x11510000 irq 29: Au1X00 10/100 Ethernet
aumac1: Ethernet address 00:0e:56:00:12:2c
ohci0 at aubus0 addr 0x10100000 irq 26: Alchemy OHCI
ohci0: OHCI version 1.0
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: vendor 0x0000 OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
aupci0 at aubus0 addr 0x14005000: Alchemy Host-PCI Bridge, 66MHz
pci0 at aupci0 bus 0
pci0: i/o space, memory space enabled
wi0 at pci0 dev 0 function 0: Intersil PRISM2.5 Mini-PCI WLAN (rev. 0x01)
wi0: interrupting at irq 1
wi0: 802.11 address 00:90:4b:0a:de:81
wi0: using RF:PRISM2.5 MAC:ISL3874A(Mini-PCI)
wi0: Intersil Firmware: Primary (1.1.1), Station (1.7.4)
wi0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps
augpio0 at aubus0 addr 0x11900100: Alchemy GPIO, primary block
gpio0 at augpio0: 23 pins
augpio1 at aubus0 addr 0x11700000: Alchemy GPIO, secondary block
gpio1 at augpio1: 16 pins
timecounter: Timecounter "clockinterrupt" frequency 512 Hz quality 0
timecounter: Timecounter "mips3_cp0_counter" frequency 324000000 Hz quality 100
uhub0: device problem, disabling port 1
root on aumac0
nfs_boot: trying DHCP/BOOTP
nfs_boot: DHCP next-server: 192.168.150.7
nfs_boot: my_domain=duskware.de
nfs_boot: my_addr=192.168.150.148
nfs_boot: my_mask=255.255.255.0
nfs_boot: gateway=192.168.150.10
root on 192.168.150.7:/usr/exp/hosts/evbmips
root time: 0x45b370d5
WARNING: preposterous TOD clock time
WARNING: using filesystem time
WARNING: CHECK AND RESET THE DATE!
init: copying out path `/sbin/init' 11


>How-To-Repeat:
>Fix: