Subject: Re: NetBSD/xen network problems (need help)
To: Mike M. Volokhov <mishka@NetBSD.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-xen
Date: 01/23/2006 12:41:54
On Mon, Jan 23, 2006 at 10:34:01AM +0200, Mike M. Volokhov wrote:
> Hello!
> 
> I have a Xen 2.0.7 and NetBSD 3.0_STABLE (tested on 24.12.05 and
> 20.01.06 sources) setup with four domUs, all configured accordingly to
> Ports/xen/howto. So far, so good. After we have got yet another
> Internet connection I'm willing to setup one of domU as router/ipf/nat/
> ipsec/altq ipv4-only box. And this is why I'm got a lot of PITA here :-O
> 
> Because system have two interfaces (details see below) I've used the
> following scheme (plus yet another two domains attached to bridge0, but
> not shown here):
> 
> 
> [LAN] === <bge0 ----- dom0 ---- bge1> ===== [WAN]
>             |                    |
>           bridge0             bridge1
>           |     |                |
>      xvif1.0   xvif2.0        xvif2.1
>           |     |                |
>      xennet0   xennet0        xennet1
>       dom1      dom2           dom2

OK, I have similar setups (one of my domU have 6 interfaces, connected to
6 bridges)

> 
> 
> It's worked. The bge1/WAN configuration was added recently (dual NIC
> mobo), when bge0 was worked up for a weeks. But now often (once per few
> few minutes) all network interfaces are just hanged up for a few
> minutes. I've also noted that hangups are somehow intersected with a
> lot of duplicated packets produced by all domU machines. For example,
> ping statistics showing the following results:
> 
>   4650 packets transmitted, 2949 packets received, +27752 duplicates, 36.6% packet loss
>   round-trip min/avg/max/stddev = 16.536/43066.884/94286.876/29929.792 ms

From where to where is this ping ? Also  it would be interesting
to run tcpdump on the other end, to see in which direction the
packet is dupliced.

> 
> There is a 'netstat -i | grep -e Name -e Link' output for dom0 and for
> dom3 (dom5 is actually just restarted dom2, please see scheme above):
> 
> Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
> bge0  1500  <Link>        00:30:48:84:cf:98   172243   862   406381     0     0
> bge1  1500  <Link>        00:30:48:84:cf:99        0     0       20     0     0
> lo0   33192 <Link>                              1524     0     1524     0     0
> bridg 1500  <Link>                            625468     0  1175737 181978     0
> bridg 1500  <Link>                                17     0       27     0     0
> xvif1 1500  <Link>        aa:00:00:21:be:8b   121192 87493   188969     0     0
> xvif3 1500  <Link>        aa:00:00:05:e7:86   111157 88327   179549     0     0
> xvif4 1500  <Link>        aa:00:00:27:74:18    69277 87680   144414     0     0
> xvif5 1500  <Link>        aa:00:00:51:08:e4   166169 30719   252683     0     0
> xvif5 1500  <Link>        aa:00:00:51:08:e5       11     0        2     0     0

Lots of errors on xvifs. Do you have any message in dmesg ? In the
driver, there are several places where there is a printf before the input
error counter is incremented.
However, one place where it's silently incremented is if it can't get
a mbuf (e.g. if you get "mclpool limit reached"). This would also explain
the network hangs.


> 
> Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
> lo0   33192 <Link>                                28     0       28     0     0
> xenne 1500  <Link>        aa:00:00:04:e7:86   179545     0   111152 88327     0

The output errors on this side are probably related to the input errors
on the dom0.

> [...]
> 
> Also, I've faced with kernel panics on domU machine (see below; btw,
> how to save core dump? "sync" isn't working - dump device bad, /netbsd
> is a copy of really booted kernel).

kernel core dump don't work yet on Xen, It's on my todo list.

> 
> Panic message:
> 
> panic: m_makewritable: length changed
> Stopped at      netbsd:cpu_Debugger+0x4:        leave
> cpu_Debugger(c03f8d38,38,c03f8ce8,c03f8d38,0) at netbsd:cpu_Debugger+0x4
> panic(c0331900,1,0,0,0) at netbsd:panic+0x121
> m_makewritable(c03f8d38,0,3b9aca00,1,c0871500) at netbsd:m_makewritable+0x6b
> fr_check_wrapper(0,c03f8d38,c072d038,1,c0871800) at netbsd:fr_check_wrapper+0x1b

This is an internal diagnostig to m_makewritable(). Now, I see this
check doesn't check the error code returned by m_copyback0(), so it's
possible it's triggered because of ressources shortage on mbuf pool.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--