Subject: Debugging nic related kernel panic
To: None <tech-kern@netbsd.org>
From: Rafael Almeida <almeidaraf@gmail.com>
List: tech-kern
Date: 03/02/2007 01:43:04
Hello,

I'm using netbsd 3.0.1 on a pentium mmx 233, with 32mb of ram. I have
two nics on it, a Myson MTD803 and a VIA Rhine. That computer works as a
router (I also use it as a ssh, http and darcs server), I have 3
computers conected to it on a nat. On my pf.conf I have the following:

  ext_if="pppoe0"
  int_if="mtd0"

  scrub out all max-mss 1440

  nat on $ext_if from !($ext_if) -> ($ext_if:0) static-port

  rdr pass on $ext_if proto { tcp, udp } to port 6112:6119 -> 192.168.0.3
  rdr pass on $ext_if proto tcp to port 4662 -> 192.168.0.2
  rdr pass on $ext_if proto udp to port 4672 -> 192.168.0.2
  rdr pass on $ext_if proto { tcp, udp } to port 7846 -> 192.168.0.2

  pass in all keep state
  pass out all keep statee

The vr0 device is the one configured to connect to the internet using
pppoe. The ports 4662 and 4672 are used by the amule program (a clone of
emule, a p2p program) on the 192.168.0.2 machine.

After running amule for a while I get a kernel panic, not in the machine
running amule, but in the router running netbsd. It seems to be
something related to the mtd nic, this is the bt I get using gdb on the
core dumped:

  #0  0x01f00000 in ?? ()
  #1  0xc02d483f in cpu_reboot (howto=256, bootstr=0x0) at
../../../../arch/i386/i386/machdep.c:751
  #2  0xc02557d3 in panic (fmt=0xc0364715 "trap") at
../../../../kern/subr_prf.c:242
  #3  0xc02dc33d in trap (frame=0xc0447ca8) at
../../../../arch/i386/i386/trap.c:336
  #4  0xc0102d47 in calltrap ()
  #5  0xc0255082 in pool_cache_get_paddr (pc=0xc03ed480, flags=0,
pap=0xc06bb554) at ../../../../kern/subr_pool.c:1920
  #6  0xc01acdb7 in mtd_get (sc=0xc0587000, index=50, totlen=1514) at
../../../../dev/ic/mtd803.c:647
  #7  0xc01ad0cb in mtd_rxirq (sc=0xc0587000) at ../../../../dev/ic/mtd803.c:723
  #8  0xc01ad2e3 in mtd_irq_h (args=0xc0587000) at
../../../../dev/ic/mtd803.c:865
  #9  0xc01016e1 in Xintr_legacy5 ()

I don't fully understand how calltrap gets called in the process, but
the problem looks like accessing a certain invalid memory address. After
all, this is the line 1920 of kern/subr_pool.c:

  object = pool_get(pc->pc_pool, flags);

So my guess is that pc points to some invalid address and the kernel
panics when it's dereferenced. Typing:

  dmesg -N /var/crash/netbsd.1 -M /var/crash/netbsd.1.core

Gives me this output:

  vr0: unable to load Tx buffer, error = 22 (this message repeats several times)
  fatal page fault in supervisor mode
  trap type 6 code 2 eip c025416b cs 8 eflags 10206 cr2 4172c2aa ilevel a
  panic: trap
  syncing disks... done

  dumping to dev 0,1 offset 132744
  dump 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11
10 9 8 7 6 5 4 3 2 1

Although my other nic gives those "error = 22", it doesn't seem fatal, and
it doesn't seem related to the kernel panic, as in the backtrace only
functions related to mtd are printed.

I'm not sure where I can go from here to actually figure out what could
be wrong.