Subject: Re: NetBSD/xen network problems (need help)
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Mike M. Volokhov <mishka@intostroy.com>
List: port-xen
Date: 01/25/2006 17:44:44
On Tue, 24 Jan 2006 20:51:19 +0100
Manuel Bouyer <bouyer@antioche.eu.org> wrote:
> On Tue, Jan 24, 2006 at 04:38:48PM +0200, Mike M. Volokhov wrote:
> >
> > Well, behaviour has been changed. I've rebuilt all kernels and
> > restarted the system. After that all worked fine for some time, and
> > after a hour or so the whole system (including dom0) has been frozen
> > again. Network stats:
> >
> > Name Address Ipkts Ierrs Opkts Oerrs Colls
> > bge0 00:30:48:84:cf:98 31867 3714 24943 0 0
> > bge1 00:30:48:84:cf:99 23117 989 28967 0 0
> > lo0 206 0 206 0 0
> > bridge0 56841 0 58398 1312 0
> > bridge1 52077 0 52074 0 0
> > xvif1.0 aa:00:00:17:e9:7f 532 28 1444 0 0
> > xvif2.0 aa:00:00:51:08:e4 24353 0 29746 0 0
> > xvif2.1 aa:00:00:51:08:e5 28960 0 22879 0 0
> > xvif3.0 aa:00:00:08:e9:d2 90 0 1055 0 0
> > xvif4.0 aa:00:00:49:06:01 4 0 954 0 0
>
> Hum, nothing looks wrong here. Still no messages in the consoles ?
Any :-( After three sequental kernel panics in domUs (twice for dom2
and one for dom1) I've reverted back my kernels. Right now is testing
your new improvements.
On the other hand, there are no more hangups and no errors on
interfaces. Altough, I've small number of output errors (no drops) on
bridge0, but all they are apperared after kernel panics (see below).
>
> >
> >
> > Heh, and with this patch I've another panic for dom2 (that one with two
> > xennet interfaces):
> >
> > panic: kernel diagnostic assertion "((pa ^ (pa + m->m_pkthdr.len)) & PG_FRAME) == 0" failed: file "../../../../arch/xen/xen/if_xennet.c", line 1036
>
> I can't see how my patch could be related to this, nor how this can
> happen. This looks like memory corruption,
Are you mean physical memory corruptions? Well, I haven't ran any
memtest on it, but server works very well with all other kernels. The
memory installed is dual Kingston KVR533D2E4/512 (DDR2, ECC) on
Supermicro P8SCT mobo (E7221 chipset).
> or m_pkthdr.len being way
> too large. If this happens again we'll have to put more printfs to see what
> really happens.
>
Again, got it already twice for about one hour (it's the same, but here
is an output for reference):
panic: kernel diagnostic assertion "((pa ^ (pa + m->m_pkthdr.len)) & PG_FRAME) == 0" failed: file "../../../../arch/xen/xen/if_xennet.c", line 1037
Stopped at netbsd:cpu_Debugger+0x4: leave
cpu_Debugger(c03f8aa8,ffffffff,c06bd600,c06a0f38,c06a0f00) at netbsd:cpu_Debugger+0x4
panic(c033c780,c03097e7,c0338e80,c0338c60,40d) at netbsd:panic+0x121
__main(c03097e7,c0338c60,40d,c0338e80,1) at netbsd:__main
xennet_start(c072f038,c03f89cc,c072f038,2,c03f8a18) at netbsd:xennet_start+0x55a
ether_output(c072f038,c06bd600,c068c710,c06a8294,c06bd600) at netbsd:ether_output+0x38b
ip_output(c06bd600,0,c036e2f4,1,0) at netbsd:ip_output+0x547
ip_forward(c06bd600,0,c072d038,1,c072d038) at netbsd:ip_forward+0x176
ip_input(c06bd600,c02c2f26,c072d038,c06a0500,0) at netbsd:ip_input+0x29a
ipintr(fffffffe,20,4,1,c03f8e10) at netbsd:ipintr+0xad
DDB lost frame for netbsd:Xsoftnet+0x4f, trying 0xc03f8dd0
Xsoftnet() at netbsd:Xsoftnet+0x4f
--- interrupt ---
emul_freebsd_object(c03f8e4c,0,3b9a0000,ca00) at 0xc03fe000
Bad frame pointer: 0xc02ad0fc
ds 0x11
es 0x11
fs 0x31
gs 0x11
edi 0x1
esi 0x100
ebp 0xc03f88d8 emul_freebsd_object+0x6fd24
ebx 0x1
edx 0xc03fe000 emul_freebsd_object+0x7544c
ecx 0xffffffc0
eax 0xa6b
eip 0xc02ab1b0 cpu_Debugger+0x4
cs 0x9
eflags 0x202
esp 0xc03f88d8 emul_freebsd_object+0x6fd24
ss 0x11
netbsd:cpu_Debugger+0x4: leave
Stopped at netbsd:cpu_Debugger+0x4: leave
db> reboot
syncing disks... panic: m_makewritable: length changed
Stopped at netbsd:cpu_Debugger+0x4: leave
db>
>
> On Tue, Jan 24, 2006 at 05:00:30PM +0200, Mike M. Volokhov wrote:
> >
> > And yet another obscure with this patch. Right now I've lost connection
> > with all my single-attached domains (i.e. with only one xennet, dom[134]),
> > and dom0 showing the output errors gradually increased on bridge0.
>
> OK, this I found why. when dropping packets with the wrong ether address
> in if_xennet.c, the receive buffer was not recycled. After some time,
> there were no RX buffer at xennet, and the transmit on the xvif stalls.
Seems there are no more drops here :-) So I'll leave this xenofarm with
new kernels over this night and then will see the results tomorrow.
BTW, are there any way to reproduce error by hands? (Uh, oh, excuse me
my limited knowledge of network stack, please :-( ).
And I'm thinking about yet another problem. AFAIU the problem appeared
when some packet from the net to xennet is incorrectly passed back to
the net trough bridge because of routing enabled in domU, right? In any
case, I've seen total network hangups, including both bge[01] interfaces
(i.e. dom0 isn't available too). So I'm wondering are there no problems
with bridge(4)? Just because it seems possible to create just the same
situation using bridges on hardware NICs only, and few malicious hosts
on the LAN... Or I'm missing something?
--
Mishka.