Subject: kern/8988: 'fatal page fault' in network stack on i386
To: None <gnats-bugs@gnats.netbsd.org>
From: Manuel Bouyer <Manuel.Bouyer@asim.lip6.fr>
List: netbsd-bugs
Date: 12/13/1999 07:43:21
>Number: 8988
>Category: kern
>Synopsis: 'fatal page fault' in network stack on i386
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people (Kernel Bug People)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Dec 13 07:42:01 1999
>Last-Modified:
>Originator:
>Organization:
LIP6
>Release: -release as of Dec 1
>Environment:
System: NetBSD jazz 1.4.1 NetBSD 1.4.1 (JAZZ) #0: Wed Dec 1 16:27:59 MET 1999 bouyer@jazz:/home/src/sys/arch/i386/compile/JAZZ i386
>Description:
This machine (celeron 400 w/ 64Mb RAM) is a smb server and router
between 2 LAN with a quad-port dec 21143 board (100Mbs, only 2 ports
used). Most samba share are nfs mounted so network traffic is quite
higth, but there are a few shares from local IDE disks too.
the two network card in use are on IRQ 9 (suboptimal, but the BIOS set
it up this way and I can't change it). Complete dmesg appened.
The machine has been stable since it's in use (2/3 months ago now)
but today I got 2 crashes at 2mn interval in the network code;
with a:
trap type 6 code 0 eip f012d257 cs 8 eflags 10286 cr2 deadbeff cpl e000d6e2
I could'nt find any occurance of beff is the sources so I guess that
the last bytes were erased with a 0xff. Kernel is compiled with
DIANOSTICS.
trace for the first crash:
(gdb) where
#0 0xf01bef5e in sys_sysarch ()
#1 0xf01b8bdf in cpu_reboot ()
#2 0xf0121d20 in log ()
#3 0xf01bf1ed in trap ()
#4 0xf0100cc9 in calltrap ()
#5 0xf01d15fd in tulip_intr_handler ()
#6 0xf01d17b1 in tulip_intr_normal ()
#7 0xf0101530 in Xintr9 ()
#8 0xf0148da9 in arpresolve ()
#9 0xf0143c6b in ether_output ()
#10 0xf014efe3 in ip_output ()
#11 0xf014e6de in ip_forward ()
#12 0xf014d66e in ipintr ()
#13 0xf0101cd8 in Xsoftnet ()
#14 0xf011c01d in tsleep ()
#15 0xf01244cf in sys_select ()
#16 0xf01bf7fa in syscall ()
#17 0xf0100d75 in syscall1 ()
(gdb) list *0xf01d15fd
0xf01d15fd is in tulip_intr_handler (../../../../dev/pci/if_de.c:3997).
3992 ;
3993 TULIP_CSR_WRITE(sc, csr_status, TULIP_STS_RXSTOPPED);
3994 sc->tulip_flags |= TULIP_RXIGNORE;
3995 }
3996 tulip_rx_intr(sc);
3997 if (sc->tulip_flags & TULIP_RXIGNORE) {
3998 /*
3999 * Restart the receiver.
4000 */
4001 sc->tulip_flags &= ~TULIP_RXIGNORE;
second crash:
#0 0xf01bef5e in sys_sysarch ()
#1 0xf01b8bdf in cpu_reboot ()
#2 0xf0121d20 in log ()
#3 0xf01bf1ed in trap ()
#4 0xf0100cc9 in calltrap ()
#5 0xf0148da9 in arpresolve ()
#6 0xf0143c6b in ether_output ()
#7 0xf014c29c in ipflow_fastforward ()
#8 0xf01440dd in ether_input ()
#9 0xf01d0ce5 in tulip_rx_intr ()
#10 0xf01d15fd in tulip_intr_handler ()
#11 0xf01d17b1 in tulip_intr_normal ()
#12 0xf0101530 in Xintr9 ()
#13 0xf0148da9 in arpresolve ()
#14 0xf0143c6b in ether_output ()
#15 0xf014efe3 in ip_output ()
#16 0xf014e6de in ip_forward ()
#17 0xf014d66e in ipintr ()
#18 0xf0101cd8 in Xsoftnet ()
#19 0xf011c01d in tsleep ()
#20 0xf011743f in physio ()
#21 0xf01c0b69 in wdread ()
#22 0xf0140817 in spec_read ()
#23 0xf01a5de9 in ufsspec_read ()
#24 0xf013ed41 in vn_read ()
#25 0xf01238dc in dofileread ()
#26 0xf0123853 in sys_read ()
#27 0xf01bf7fa in syscall ()
#28 0xf0100d75 in syscall1 ()
(gdb) list *0xf0148da9
0xf0148da9 is in arpresolve (../../../../netinet/if_arp.c:435).
430 * There is an arptab entry, but no ethernet address
431 * response yet. Replace the held mbuf with this
432 * latest one.
433 */
434 if (la->la_hold)
435 m_freem(la->la_hold);
436 la->la_hold = m;
437 /*
438 * Re-send the ARP request when appropriate.
439 */
I'm not sure the call to arpresolve() from ipflow_fastforward() is
correct (shouldn't the route be well-known when it goes through
ipflow_fastforward() ?)
Also the first trace is curious too, why would a
'sc->tulip_flags & TULIP_RXIGNORE' panic ? Hum, it's just after
tulip_rx_intr(sc), maybe the debuger got confused and the 2 panics
really involve a call to arpresolve() from ipflow_fastforward().
Could there be a race condition between the fast forward cache and
the arp expiration ?
NetBSD 1.4.1 (JAZZ) #0: Wed Dec 1 16:27:59 MET 1999
bouyer@jazz:/home/src/sys/arch/i386/compile/JAZZ
cpu0: family 6 model 6 step 0
cpu0: Intel Pentium II (Celeron) (686-class)
real mem = 66703360
avail mem = 60280832
using 839 buffers containing 3436544 bytes of memory
mainbus0 (root)
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o enabled, memory enabled
pchb0 at pci0 dev 0 function 0
pchb0: Intel 82443BX Host Bridge/Controller (rev. 0x03)
ppb0 at pci0 dev 1 function 0: Intel 82443BX AGP Interface (rev. 0x03)
pci1 at ppb0 bus 1
pci1: i/o enabled, memory enabled
vga0 at pci1 dev 0 function 0: ATI Technologies product 0x4742 (rev. 0x5c)
wsdisplay0 at vga0: console (80x25, vt100 emulation)
pcib0 at pci0 dev 4 function 0
pcib0: Intel 82371AB PCI-to-ISA Bridge (PIIX4) (rev. 0x02)
pciide0 at pci0 dev 4 function 1: Intel 82371AB IDE controller (PIIX4)
pciide0: bus-master DMA support present
pciide0: primary channel wired to compatibility mode
wd0 at pciide0 channel 0 drive 0: <QUANTUM FIREBALL CR4.3A>
wd0: drive supports 16-sector pio transfers, lba addressing
wd0: 4110MB, 14848 cyl, 9 head, 63 sec, 512 bytes/sect x 8418816 sectors
wd0: 32-bits data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4, Ultra-DMA mode 3, Ultra-DMA mode 2
wd1 at pciide0 channel 0 drive 1: <IBM-DJNA-372200>
wd1: drive supports 16-sector pio transfers, lba addressing
wd1: 21557MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 44150400 sectors
wd1: 32-bits data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4, Ultra-DMA mode 3, Ultra-DMA mode 2
pciide0: primary channel interrupting at irq 14
pciide0: secondary channel wired to compatibility mode
wd2 at pciide0 channel 1 drive 0: <QUANTUM FIREBALL CR8.4A>
wd2: drive supports 16-sector pio transfers, lba addressing
wd2: 8063MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 16514064 sectors
wd2: 32-bits data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4, Ultra-DMA mode 3, Ultra-DMA mode 2
wd3 at pciide0 channel 1 drive 1: <IBM-DJNA-372200>
wd3: drive supports 16-sector pio transfers, lba addressing
wd3: 21557MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 44150400 sectors
wd3: 32-bits data port
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4, Ultra-DMA mode 3, Ultra-DMA mode 2
pciide0: secondary channel interrupting at irq 15
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfers)
wd1(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfers)
wd2(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfers)
wd3(pciide0:1:1): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfers)
Intel 82371AB USB Host Controller (PIIX4) (USB serial bus, revision 0x01) at pci0 dev 4 function 2 not configured
Intel 82371AB Power Management Controller (PIIX4) (miscellaneous bridge, revision 0x02) at pci0 dev 4 function 3 not configured
ppb1 at pci0 dev 10 function 0: Digital Equipment DECchip 21152 PCI-PCI Bridge (rev. 0x03)
pci2 at ppb1 bus 2
pci2: i/o enabled, memory enabled
de0 at pci2 dev 4 function 0
de0: interrupting at irq 9
de0: 21143 [10-100Mb/s] pass 4.1
de0: address 00:80:c8:4e:f4:14
de1 at pci2 dev 5 function 0
de1: interrupting at irq 9
de1: 21143 [10-100Mb/s] pass 4.1
de1: address 00:80:c8:4e:f4:15
de2 at pci2 dev 6 function 0
de2: interrupting at irq 10
de2: 21143 [10-100Mb/s] pass 4.1
de2: address 00:80:c8:4e:f4:16
de3 at pci2 dev 7 function 0
de3: interrupting at irq 5
de3: 21143 [10-100Mb/s] pass 4.1
de3: address 00:80:c8:4e:f4:17
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
boca0 at isa0 port 0x100-0x13f irq 11
com2 at boca0 slave 0: ns16550a, working fifo
com3 at boca0 slave 1: ns16550a, working fifo
com4 at boca0 slave 2: ns16550a, working fifo
com5 at boca0 slave 3: ns16550a, working fifo
com6 at boca0 slave 4: ns16550a, working fifo
com7 at boca0 slave 5: ns16550a, working fifo
com8 at boca0 slave 6: ns16550a, working fifo
com9 at boca0 slave 7: ns16550a, working fifo
lpt0 at isa0 port 0x378-0x37b irq 7
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0
pcppi0 at isa0 port 0x61
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
isapnp0: no ISA Plug 'n Play devices found
biomask c040 netmask c660 ttymask d6e2
wscons: wskbd0 glued to wsdisplay0 (console)
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
de0: enabling Full Duplex 100baseTX port
de1: enabling 100baseTX port
de2: autosense failed: cable problem?
de3: autosense failed: cable problem?
de0: enabling 100baseTX port
de1: enabling 100baseTX port
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
de1: abnormal interrupt: transmit underflow (raising TX threshold to 96|256)
de1: abnormal interrupt: transmit underflow (raising TX threshold to 128|512)
de1: abnormal interrupt: transmit underflow (raising TX threshold to 160|1024)
>How-To-Repeat:
doesn't seem to be easy to repeat. Seems to be related to arp
activity; that is, if the machine can fill in his arp table it
will then be stable :)
>Fix:
Unknown. I didn't see it with the previous kernel from older sources
(which I upgraded because of IDE problems). Check the recent changes
to network code ?
>Audit-Trail:
>Unformatted: