Subject: kern/25717: panic: lockmgr: no context during network load on MP sparc
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <bsieker@rvs.uni-bielefeld.de>
List: netbsd-bugs
Date: 05/26/2004 13:33:57
>Number:         25717
>Category:       kern
>Synopsis:       panic: lockmgr: no context during network load on MP sparc
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed May 26 13:34:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Bernd Sieker
>Release:        NetBSD-2.0_BETA (2004-05-22)
>Organization:
>Environment:
NetBSD portier.home.loc 2.0_BETA NetBSD 2.0_BETA (PORTIER.MP) #6: Wed May 26 11:02:42 CEST 2004  bernd@boa.home.loc:/usr/source/2.0/src/sys/arch/sparc/compile/obj/PORTIER.MP sparc

>Description:
The machine panicks under moderate network load with "lockmgr: no context". Here is a traceback of a typical crash:

cpu_Debugger(0xf01f5170, 0xf01f4e58, 0x100, 0x0, 0x0, 0xf022bc00) at netbsd:_lockmgr+0x2a4
_lockmgr(0xf024c39c, 0x1, 0x0, 0xf0200400, 0x93, 0xf0249a18) at netbsd:uvmfault_lookup+0x21c
uvmfault_lookup(0xf0221bc8, 0x0, 0x356, 0xf022b000, 0x2710, 0xf026fc00) at netbsd:uvm_fault+0x58
uvm_fault(0xf024c398, 0xf2d2c000, 0x0, 0x2, 0x538, 0x8) at netbsd:mem_access_fault4m+0x3d8
mem_access_fault4m(0x9, 0x3a6, 0xf2d2c000, 0xf0221d30, 0xf0a22ac8, 0xf01d8eb4) at 0xf000625c
0xf000625c(0xf2d2bff4, 0xf05465f8, 0x16, 0x0, 0x578, 0xf0a22acc) at netbsd:lance_copytobuf_contig+0x18
lance_copytobuf_contig(0xf04ae400, 0xf05465e8, 0x6bff4, 0x16, 0x5, 0x5) at netbsd:lance_put+0x1a4
lance_put(0x6899a, 0x6c00a, 0xf0546500, 0x8, 0xf0269dcc, 0x0) at netbsd:am7990_start+0xac
am7990_start(0xf04ae438, 0xf0221f2c, 0x2, 0x5a, 0x0, 0xf0228000) at netbsd:ether_output+0x380
ether_output(0x0, 0xf0546500, 0xf0221fa8, 0x8864, 0x88, 0xdc) at netbsd:pppoe_output+0xb8
pppoe_output(0xf0537000, 0xf0546500, 0x0, 0x1, 0xf0222084, 0xf04a9b44) at netbsd:pppoe_start+0x10c
pppoe_start(0xf0546500, 0x2, 0x1, 0x0, 0xf0561334, 0x2) at netbsd:sppp_output+0x2d8
sppp_output(0xf0537000, 0xf0546500, 0xf0267c94, 0xf05443b8, 0x0, 0xf0561348) at netbsd:ip_output+0x66c
ip_output(0x0, 0x14, 0xf0537000, 0xf0267c94, 0x578, 0xf0561334) at netbsd:ip_forward+0x1fc
ip_forward(0xf0561300, 0x0, 0x78, 0x1d, 0xacb31d78, 0xacb31d00) at netbsd:ip_input+0x488
ip_input(0xf0561300, 0xf01f4e58, 0x356, 0x0, 0x40100, 0xf024b588) at netbsd:ipintr+0x88
ipintr(0xf02222c8, 0xf01b0cc4, 0x100, 0x408000e7, 0x538, 0x100) at 0xf00066c0
0xf00066c0(0xf022b658, 0xf01f7500, 0x292, 0x0, 0x200, 0x1c) at netbsd:switchexit+0xf0


Here is the complete dmesg out of the problem machine:

NetBSD 2.0_BETA (PORTIER.MP) #2: Sat May 22 19:02:21 CEST 2004
        bernd@boa.home.loc:/usr/source/2.0/src/sys/arch/sparc/compile/obj/PORTIER.MP
total memory = 64816 KB
avail memory = 59668 KB
bootpath: /iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@1,0
mainbus0 (root): SUNW,SPARCstation-20: hostid 7271bfaf
cpu0 at mainbus0: mid 8: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 b/l): cache enabled
cpu1 at mainbus0: mid 10: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu1: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 b/l): cache enabled
obio0 at mainbus0
clock0 at obio0 slot 0 offset 0x200000: mk48t08
timer0 at obio0 slot 0 offset 0x300000: delay constant 35
zs0 at obio0 slot 0 offset 0x100000 level 12 softpri 6
zstty0 at zs0 channel 0 (console i/o)
zstty1 at zs0 channel 1
zs1 at obio0 slot 0 offset 0x0 level 12 softpri 6
kbd0 at zs1 channel 0: baud rate 1200
ms0 at zs1 channel 1: baud rate 1200
fdc0 at obio0 slot 0 offset 0x700000 level 11 softpri 4: chip 82077
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
auxreg0 at obio0 slot 0 offset 0x800000
power0 at obio0 slot 0 offset 0xa01000 level 2
iommu0 at mainbus0 ioaddr 0xe0000000: version 0x3/0x1, page-size 4096, range 64MB
sbus0 at iommu0: clock = 25 MHz
dma0 at sbus0 slot 15 offset 0x400000: DMA rev 2
esp0 at dma0 slot 15 offset 0x800000 level 4: ESP200, 40MHz, SCSI ID 7
scsibus0 at esp0: 8 targets, 8 luns per target
ledma0 at sbus0 slot 15 offset 0x400010: DMA rev 2
le0 at ledma0 slot 15 offset 0xc00000 level 6: address 08:00:20:71:bf:af
le0: 8 receive buffers, 2 transmit buffers
bpp0 at sbus0 slot 15 offset 0x4800000 level 2 (ipl 3): DMA rev 2
bpp: hcr 0 ocr 200a tcr 8 or 0
SUNW,DBRIe at sbus0 slot 14 offset 0x10000 level 9 not configured
hme0 at sbus0 slot 2 offset 0x8c00000 level 4 (ipl 7): Sun Happy Meal Ethernet (SUNW,hme)
hme0: Ethernet address 08:00:20:71:bf:af
nsphy0 at hme0 phy 1: DP83840 10/100 media interface, rev. 1
nsphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
eccmemctl0 at mainbus0 ioaddr 0x0: version 0x0/0x2
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 1 lun 0: <WDIGTL, ENTERPRISE, 1.91> disk fixed
sd0: 4157 MB, 5720 cyl, 8 head, 186 sec, 512 bytes/sect x 8515173 sectors
sd0: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 6 lun 0: <TOSHIBA, XM-4101TASUNSLCD, 3424> cdrom removable
cd0: async, 8-bit transfers
root on sd0a dumps on sd0b
mountroot: trying cd9660...
mountroot: trying nfs...
mountroot: trying ffs...
root file system type: ffs
cpu0: booting secondary processors: cpu1
init: copying out path `/sbin/init' 11
panic: lockmgr: no context


The CPUs are SM71 modules, one with SuperCache 3.x, the other with
SuperCache 4.x.

The machine is the NAT-router for my LAN. The LAN is connected via the
SBus hme interface, the Internet-connection is A-DSL, the DSL-modem
is connected to the on-board le0 interface.

Besides doing IP-NAT he machine also uses some ipf filtering rules,
runs portsentry and also a squid web proxy with the squidGuard filter.
I have recently starting experimenting with altq, but had the same
crashes with the same frequency before that.

During moderate network load the machine panicks frequently, typically
after less than 24 hours of uptime.

The network load is mostly p2p traffic (overnet), some proxied
web surfing and a few ssh sessions.

The DSL upstream (128kpbs) is typically almost fully used, the
downstream (768kbps) about halfway.

>How-To-Repeat:
Run an IP-NAT gateway on a multiprocessor sparc and use the network.

I cannot pinpoint any specific action that would lead to an instantaneous
crash, but it happens repeatedly.

>Fix:
Unknown, except that it seems locking needs some more work before
2.0 is released to the general public.

>Release-Note:
>Audit-Trail:
>Unformatted: