Subject: NetBSD/xen network problems (need help)
To: None <port-xen@NetBSD.org>
From: Mike M. Volokhov <mishka@NetBSD.org>
List: port-xen
Date: 01/23/2006 10:34:01
Hello!

I have a Xen 2.0.7 and NetBSD 3.0_STABLE (tested on 24.12.05 and
20.01.06 sources) setup with four domUs, all configured accordingly to
Ports/xen/howto. So far, so good. After we have got yet another
Internet connection I'm willing to setup one of domU as router/ipf/nat/
ipsec/altq ipv4-only box. And this is why I'm got a lot of PITA here :-O

Because system have two interfaces (details see below) I've used the
following scheme (plus yet another two domains attached to bridge0, but
not shown here):


[LAN] === <bge0 ----- dom0 ---- bge1> ===== [WAN]
            |                    |
          bridge0             bridge1
          |     |                |
     xvif1.0   xvif2.0        xvif2.1
          |     |                |
     xennet0   xennet0        xennet1
      dom1      dom2           dom2


It's worked. The bge1/WAN configuration was added recently (dual NIC
mobo), when bge0 was worked up for a weeks. But now often (once per few
few minutes) all network interfaces are just hanged up for a few
minutes. I've also noted that hangups are somehow intersected with a
lot of duplicated packets produced by all domU machines. For example,
ping statistics showing the following results:

  4650 packets transmitted, 2949 packets received, +27752 duplicates, 36.6% packet loss
  round-trip min/avg/max/stddev = 16.536/43066.884/94286.876/29929.792 ms

There is a 'netstat -i | grep -e Name -e Link' output for dom0 and for
dom3 (dom5 is actually just restarted dom2, please see scheme above):

Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
bge0  1500  <Link>        00:30:48:84:cf:98   172243   862   406381     0     0
bge1  1500  <Link>        00:30:48:84:cf:99        0     0       20     0     0
lo0   33192 <Link>                              1524     0     1524     0     0
bridg 1500  <Link>                            625468     0  1175737 181978     0
bridg 1500  <Link>                                17     0       27     0     0
xvif1 1500  <Link>        aa:00:00:21:be:8b   121192 87493   188969     0     0
xvif3 1500  <Link>        aa:00:00:05:e7:86   111157 88327   179549     0     0
xvif4 1500  <Link>        aa:00:00:27:74:18    69277 87680   144414     0     0
xvif5 1500  <Link>        aa:00:00:51:08:e4   166169 30719   252683     0     0
xvif5 1500  <Link>        aa:00:00:51:08:e5       11     0        2     0     0

Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
lo0   33192 <Link>                                28     0       28     0     0
xenne 1500  <Link>        aa:00:00:04:e7:86   179545     0   111152 88327     0

Please note the errors on dom3/xennet0 = dom0/xvif3.0. Another note is
that errors have a bursted nature. I.e. all works fine, then I got
hangup, and then stats show me a lot of errors. After that all works
good again.

Previously (9 days uptime, daily output):

Name          Ipkts  Ierrs      Opkts  Oerrs  Colls
bge0       19158819      0   20321849      0      0
bge1           4813      0         53      0      0
lo0           42589      0      42589      0      0
bridge0    39464623      0   43880371      0      0
xvif1.0      278513      0    1390149      0      0
xvif3.0      329503      0    1419529      0      0
xvif4.0      529427      0    1291286      0      0
bridge1        4792      0       4762      0      0
xvif13.0   15701265      0   14390989      0      0
xvif13.1          0      0       4599      0      0
xvif15.0          0      0     407593      0      0

Also, I've faced with kernel panics on domU machine (see below; btw,
how to save core dump? "sync" isn't working - dump device bad, /netbsd
is a copy of really booted kernel).

WTF here?! Where I'm wrong? Any help or advice on how to debug this
would be very much appreciated.

--
Mishka.


P.S. So, there is some details about physical interfaces:

bge0 at pci2 dev 0 function 0: Broadcom BCM5721 Gigabit Ethernet
bge0: interrupting at irq 16, event channel 7
bge0: ASIC BCM5751 A1 (0x4101), Ethernet address 00:30:48:84:cf:98
brgphy0 at bge0 phy 1: BCM5750 1000BASE-T media interface, rev. 0
bge1 at pci3 dev 0 function 0: Broadcom BCM5721 Gigabit Ethernet
bge1: interrupting at irq 17, event channel 12
bge1: ASIC BCM5751 A1 (0x4101), Ethernet address 00:30:48:84:cf:99
brgphy1 at bge1 phy 1: BCM5750 1000BASE-T media interface, rev. 0

Panic message:

panic: m_makewritable: length changed
Stopped at      netbsd:cpu_Debugger+0x4:        leave
cpu_Debugger(c03f8d38,38,c03f8ce8,c03f8d38,0) at netbsd:cpu_Debugger+0x4
panic(c0331900,1,0,0,0) at netbsd:panic+0x121
m_makewritable(c03f8d38,0,3b9aca00,1,c0871500) at netbsd:m_makewritable+0x6b
fr_check_wrapper(0,c03f8d38,c072d038,1,c0871800) at netbsd:fr_check_wrapper+0x1b
pfil_run_hooks(c036e9e0,c03f8da0,c072d038,1,c03f8dc8) at netbsd:pfil_run_hooks+0x6e
ip_input(c0871800,c01142b2,9,202,0) at netbsd:ip_input+0x93b
ipintr(fffffffe,20,4,1,c03f8e10) at netbsd:ipintr+0xad
DDB lost frame for netbsd:Xsoftnet+0x4f, trying 0xc03f8dd0
Xsoftnet() at netbsd:Xsoftnet+0x4f
--- interrupt ---
emul_freebsd_object(c03f8e4c,0,3b9a0000,ca00) at 0xc03fe000
Bad frame pointer: 0xc02ad848
ds          0x11
es          0x11
fs          0x31
gs          0x11
edi         0x1
esi         0x100
ebp         0xc03f8c98  emul_freebsd_object+0x6f9c4
ebx         0x1
edx         0xc03fe000  emul_freebsd_object+0x74d2c
ecx         0xfffffff8
eax         0x9fd
eip         0xc02ab8fc  cpu_Debugger+0x4
cs          0x9
eflags      0x202
esp         0xc03f8c98  emul_freebsd_object+0x6f9c4
ss          0x11
netbsd:cpu_Debugger+0x4:        leave
Stopped at      netbsd:cpu_Debugger+0x4:        leave
db>