port-xen: xen network issues

Subject: xen network issues
To: None <port-xen@netbsd.org>
From: Johan Ihren <johani@johani.org>
List: port-xen
Date: 02/26/2006 07:46:16
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have two physical servers each running about ten domUs. Everything =20
(dom0s and domUs) are all NetBSD 3.0REL. To keep down the size of the =20=

fs images for the domUs I export /usr/share+/usr/X11R6+/usr/pkg via =20
NFS from the dom0s to the (local) domUs. I then thread together all =20
the domUs via a large number of VLANs.

In testing this has worked like a charm. I've run parallell compiles =20
on all the domUs, done all the various (mostly DNS related) stuff I =20
need to (mixed v4/v6 transport, lot's of internal topology, packet =20
filters, parallell dhcp environments, etc, etc).

However, when doing this for real, with live students, recently I had =20=

som trouble. The students sit on individual desktops and ssh into =20
their "own" domUs. Typically the  physical server bogged down =20
entirely on occasion, the interrupt rate on the dom0 reached 100% and =20=

the network interface started to have device timeouts.

 =46rom there on things went downhill, almost impossible to get shell =20=

access to respond at all. In the end I typically unplugged the =20
physical network (to separate the physical servers) to try to =20
recover. Usually it did recover, although it took several minutes.

Because this was a training environment I unfortunately did not have =20
much opportunity to debug this (students to take care of), so I just =20
left it on its own and hoped for recovery. Therefore there's not much =20=

hard data other than:

* 100% interrupt rate
* oodles of "sip0: FIFO ring overrun" on one server
* "fxp0: device timeout" on the other server
* oodles of "nfs_timer: ignoring error 64" on all domUs

What I did *not* find was any massiv network traffic. I.e. no raging =20
storms that I could see.

One other thing I did notice was that on occasion the remote access =20
to the domUs failed (i.e. the connection appeared to hang), while the =20=

actual machines were just fine. An "ifconfig fxp0 down / ifconfig =20
fxp0 up" seemed to clear that usually if I was quick about it. If =20
busy with other stuff I think that evolved into the general catatonic =20=

state.

I'm really sorry I don't have more detailed information.

Johan Ihr=E9n


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)

iD8DBQFEAU68KJmr+nqSTbYRAqS/AKCnHPavXboJluASs+zASzbzNB/6JgCggm1Z
uoBkA3JEUXYiff3kCo92Tj8=3D
=3DNX2Y
-----END PGP SIGNATURE-----