Subject: Re: NetBSD/xen network problems (need help)
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Mike M. Volokhov <mishka@intostroy.com>
List: port-xen
Date: 01/25/2006 17:44:44
On Tue, 24 Jan 2006 20:51:19 +0100
Manuel Bouyer <bouyer@antioche.eu.org> wrote:

> On Tue, Jan 24, 2006 at 04:38:48PM +0200, Mike M. Volokhov wrote:
> > 
> > Well, behaviour has been changed. I've rebuilt all kernels and
> > restarted the system. After that all worked fine for some time, and
> > after a hour or so the whole system (including dom0) has been frozen
> > again. Network stats:
> > 
> > Name     Address              Ipkts Ierrs    Opkts Oerrs Colls
> > bge0     00:30:48:84:cf:98    31867  3714    24943     0     0
> > bge1     00:30:48:84:cf:99    23117   989    28967     0     0
> > lo0                             206     0      206     0     0
> > bridge0                       56841     0    58398  1312     0
> > bridge1                       52077     0    52074     0     0
> > xvif1.0  aa:00:00:17:e9:7f      532    28     1444     0     0
> > xvif2.0  aa:00:00:51:08:e4    24353     0    29746     0     0
> > xvif2.1  aa:00:00:51:08:e5    28960     0    22879     0     0
> > xvif3.0  aa:00:00:08:e9:d2       90     0     1055     0     0
> > xvif4.0  aa:00:00:49:06:01        4     0      954     0     0
> 
> Hum, nothing looks wrong here. Still no messages in the consoles ?

Any :-( After three sequental kernel panics in domUs (twice for dom2
and one for dom1) I've reverted back my kernels. Right now is testing
your new improvements.

On the other hand, there are no more hangups and no errors on
interfaces. Altough, I've small number of output errors (no drops) on
bridge0, but all they are apperared after kernel panics (see below).

> 
> > 
> > 
> > Heh, and with this patch I've another panic for dom2 (that one with two
> > xennet interfaces):
> > 
> > panic: kernel diagnostic assertion "((pa ^ (pa + m->m_pkthdr.len)) & PG_FRAME) == 0" failed: file "../../../../arch/xen/xen/if_xennet.c", line 1036
> 
> I can't see how my patch could be related to this, nor how this can
> happen. This looks like memory corruption, 

Are you mean physical memory corruptions? Well, I haven't ran any
memtest on it, but server works very well with all other kernels. The
memory installed is dual Kingston KVR533D2E4/512 (DDR2, ECC) on
Supermicro P8SCT mobo (E7221 chipset).

> or m_pkthdr.len being way
> too large. If this happens again we'll have to put more printfs to see what
> really happens.
> 

Again, got it already twice for about one hour (it's the same, but here
is an output for reference):

panic: kernel diagnostic assertion "((pa ^ (pa + m->m_pkthdr.len)) & PG_FRAME) == 0" failed: file "../../../../arch/xen/xen/if_xennet.c", line 1037
Stopped at      netbsd:cpu_Debugger+0x4:        leave
cpu_Debugger(c03f8aa8,ffffffff,c06bd600,c06a0f38,c06a0f00) at netbsd:cpu_Debugger+0x4
panic(c033c780,c03097e7,c0338e80,c0338c60,40d) at netbsd:panic+0x121
__main(c03097e7,c0338c60,40d,c0338e80,1) at netbsd:__main
xennet_start(c072f038,c03f89cc,c072f038,2,c03f8a18) at netbsd:xennet_start+0x55a
ether_output(c072f038,c06bd600,c068c710,c06a8294,c06bd600) at netbsd:ether_output+0x38b
ip_output(c06bd600,0,c036e2f4,1,0) at netbsd:ip_output+0x547
ip_forward(c06bd600,0,c072d038,1,c072d038) at netbsd:ip_forward+0x176
ip_input(c06bd600,c02c2f26,c072d038,c06a0500,0) at netbsd:ip_input+0x29a
ipintr(fffffffe,20,4,1,c03f8e10) at netbsd:ipintr+0xad
DDB lost frame for netbsd:Xsoftnet+0x4f, trying 0xc03f8dd0
Xsoftnet() at netbsd:Xsoftnet+0x4f
--- interrupt ---
emul_freebsd_object(c03f8e4c,0,3b9a0000,ca00) at 0xc03fe000
Bad frame pointer: 0xc02ad0fc
ds          0x11
es          0x11
fs          0x31
gs          0x11
edi         0x1
esi         0x100
ebp         0xc03f88d8  emul_freebsd_object+0x6fd24
ebx         0x1
edx         0xc03fe000  emul_freebsd_object+0x7544c
ecx         0xffffffc0
eax         0xa6b
eip         0xc02ab1b0  cpu_Debugger+0x4
cs          0x9
eflags      0x202
esp         0xc03f88d8  emul_freebsd_object+0x6fd24
ss          0x11
netbsd:cpu_Debugger+0x4:        leave
Stopped at      netbsd:cpu_Debugger+0x4:        leave
db> reboot
syncing disks... panic: m_makewritable: length changed
Stopped at      netbsd:cpu_Debugger+0x4:        leave
db>

> 
> On Tue, Jan 24, 2006 at 05:00:30PM +0200, Mike M. Volokhov wrote:
> > 
> > And yet another obscure with this patch. Right now I've lost connection
> > with all my single-attached domains (i.e. with only one xennet, dom[134]),
> > and dom0 showing the output errors gradually increased on bridge0.
> 
> OK, this I found why. when dropping packets with the wrong ether address
> in if_xennet.c, the receive buffer was not recycled. After some time,
> there were no RX buffer at xennet, and the transmit on the xvif stalls.

Seems there are no more drops here :-) So I'll leave this xenofarm with
new kernels over this night and then will see the results tomorrow.
BTW, are there any way to reproduce error by hands? (Uh, oh, excuse me
my limited knowledge of network stack, please :-( ).

And I'm thinking about yet another problem. AFAIU the problem appeared
when some packet from the net to xennet is incorrectly passed back to
the net trough bridge because of routing enabled in domU, right? In any
case, I've seen total network hangups, including both bge[01] interfaces
(i.e. dom0 isn't available too). So I'm wondering are there no problems
with bridge(4)? Just because it seems possible to create just the same
situation using bridges on hardware NICs only, and few malicious hosts
on the LAN... Or I'm missing something?

--
Mishka.