Subject: port-alpha/35448: memory management fault trap during heavy network I/O
To: None <port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: None <agrier@poofygoof.com>
List: netbsd-bugs
Date: 01/20/2007 03:55:00
>Number:         35448
>Category:       port-alpha
>Synopsis:       memory management fault trap during heavy network I/O
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    port-alpha-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 20 03:55:00 +0000 2007
>Originator:     agrier@poofygoof.com
>Release:        NetBSD 4.99.8
>Organization:
  Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
>Environment:
	
	
System: NetBSD arwen.poofy.goof.com 4.99.8 NetBSD 4.99.8 (ARWEN) #0: Thu Jan 18 23:03:09 PST 2007 agrier@arwen.poofy.goof.com:/var/obj/ARWEN alpha
Architecture: alpha
Machine: alpha

ARWEN is an alphaserver 1000A 5/400.

the ARWEN kernel is GENERIC with hardcoded line to attach root at ld0.
>Description:

- the trap:

CPU 0: fatal kernel trap:

CPU 0    trap entry = 0x2 (memory management fault)
CPU 0    a0         = 0xfffffe0108266000
CPU 0    a1         = 0x1
CPU 0    a2         = 0x0
CPU 0    pc         = 0xfffffc00007ecde0
CPU 0    ra         = 0xfffffc000035f9ac
CPU 0    pv         = 0x0
CPU 0    curlwp    = 0xfffffc000fcd2660
CPU 0        pid = 335, comm = nfsio

panic: trap
Begin traceback...
alpha trace requires known PC =eject=
End traceback...
syncing disks... 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 giving up

- the backtrace:

(gdb) bt
#0  0xfffffc00007df888 in dumpsys ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/machdep.c:1229
#1  0xfffffc00007dfdb0 in cpu_reboot ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/machdep.c:1048
#2  0xfffffc0000644a50 in panic ()
    at /projects/NetBSD/src/sys/kern/subr_prf.c:246
#3  0xfffffc00007e7248 in trap ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/trap.c:601
#4  0xfffffc00003003e8 in XentMM ()
    at /projects/NetBSD/src/sys/arch/alpha/alpha/locore.s:492
#5  0xfffffc000035f9ac in in_delayed_cksum ()
    at /projects/NetBSD/src/sys/netinet/ip_output.c:1123
can not access 0xfffffffd, invalid translation (invalid L1 PTE)
can not access 0xfffffffd, invalid translation (invalid L1 PTE)
Cannot access memory at address 0xfffffffffffffffd

- some poking:

(gdb) frame 5
#5  0xfffffc000035f9ac in in_delayed_cksum ()
    at /projects/NetBSD/src/sys/netinet/ip_output.c:1123
1123            csum = in4_cksum(m, 0, offset, ntohs(ip->ip_len) - offset);
(gdb) proc 0xfffffc000fcd2660 # curlwp from the trap
(gdb) bt
#0  0xfffffc000062a730 in mi_switch ()
    at /projects/NetBSD/src/sys/kern/kern_synch.c:997
(gdb) list *0xfffffc00007ecde0 # pc from the trap
0xfffffc00007ecde0 is in in4_cksum
(/projects/NetBSD/src/sys/netinet/in4_cksum.c:175).

- dmesg

NetBSD 4.99.8 (ARWEN) #0: Thu Jan 18 23:03:09 PST 2007
	agrier@arwen.poofy.goof.com:/var/obj/ARWEN
AlphaServer 1000A 5/400, 400MHz, s/n 
8192 byte page size, 1 processor.
total memory = 256 MB
(2016 KB reserved for PROM, 254 MB used by NetBSD)
avail memory = 241 MB
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21164A-2
cpu0: Architecture extensions: 1<BWX>
cia0 at mainbus0: DECchip 2117x Core Logic Chipset (ALCOR/ALCOR2), pass 3
cia0: extended capabilities: 21<DWEN,BWEN>
cia0: using BWX for PCI config access
pci0 at cia0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pceb0 at pci0 dev 7 function 0: Intel 82375EB/SB PCI-EISA Bridge (rev. 0x05)
ppb0 at pci0 dev 8 function 0: Digital Equipment DC21050 PCI-PCI Bridge (rev. 0x02)
pci1 at ppb0 bus 2
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
isp0 at pci1 dev 0 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at dec_1000a irq 0
scsibus0 at isp0: 16 targets, 8 luns per target
tlp0 at pci0 dev 11 function 0: DECchip 21140 Ethernet, pass 1.2
tlp0: interrupting at dec_1000a irq 1
tlp0: DEC DE500-XA, Ethernet address 00:00:f8:02:06:a5
tlp0: 10baseT, 100baseTX, 100baseTX-FDX, 10baseT-FDX
mlx0 at pci0 dev 12 function 0: Mylex RAID (v2 interface)
mlx0: interrupting at dec_1000a irq 3
mlx0: DAC960P/PD, 3 channels, firmware 2.70-0-00, 32MB RAM
ld0 at mlx0 unit 0: RAID5, online
ld0: 16380 MB, 8320 cyl, 64 head, 63 sec, 512 bytes/sect x 33546240 sectors
ld1 at mlx0 unit 1: RAID5, online
ld1: 32768 MB, 8322 cyl, 128 head, 63 sec, 512 bytes/sect x 67108864 sectors
ld2 at mlx0 unit 2: RAID5, online
ld2: 32768 MB, 8322 cyl, 128 head, 63 sec, 512 bytes/sect x 67108864 sectors
ld3 at mlx0 unit 3: RAID5, online
ld3: 4536 MB, 2304 cyl, 64 head, 63 sec, 512 bytes/sect x 9289728 sectors
eisa0 at pceb0
eisa0: can't map I/O space for slot 9
isa0 at pceb0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
attimer0 at isa0 port 0x40-0x43: AT Timer
vga0 at isa0 port 0x3b0-0x3df iomem 0xa0000-0xbffff
wsdisplay0 at vga0 kbdmux 1
wsmux1: connecting to wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker (CPU-intensive output)
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
pcppi0: attached to attimer0
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <DEC, RZ28M    (C) DEC, 0568> disk fixed
sd0: async, 8-bit transfers
sd0: 2007 MB, 3045 cyl, 16 head, 84 sec, 512 bytes/sect x 4110480 sectors
sd0: sync (100.00ns offset 12), 8-bit (10.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 4 lun 0: <DEC, RRD45   (C) DEC, 1645> cdrom removable
cd0: async, 8-bit transfers
WARNING: can't figure what device matches "RAID 0 12 0 0 0 0 0"
root on ld0a dumps on sd0b

- other misc foo

ps won't grok the coredump:

arwen$ ps -N netbsd.gdb -M /var/crash/netbsd.0.core
ps: can't read proc credentials at 0xfffffc000ade3480: Undefined error: 0

>How-To-Repeat:
it seems to be triggered by syncing a remotely mounted mailbox from
within pine or mutt.
>Fix:
figure out what is causing the trap?  maybe a stack smash, based on
previous port-alpha mailing list entries.  perhaps

options KSTACK_CHECK_MAGIC

is in order?

>Unformatted:
 	
 	
 sources CVSed 2007-01-18