Subject: xen network issues
To: None <firstname.lastname@example.org>
From: Johan Ihren <email@example.com>
Date: 02/26/2006 07:46:16
-----BEGIN PGP SIGNED MESSAGE-----
I have two physical servers each running about ten domUs. Everything =20
(dom0s and domUs) are all NetBSD 3.0REL. To keep down the size of the =20=
fs images for the domUs I export /usr/share+/usr/X11R6+/usr/pkg via =20
NFS from the dom0s to the (local) domUs. I then thread together all =20
the domUs via a large number of VLANs.
In testing this has worked like a charm. I've run parallell compiles =20
on all the domUs, done all the various (mostly DNS related) stuff I =20
need to (mixed v4/v6 transport, lot's of internal topology, packet =20
filters, parallell dhcp environments, etc, etc).
However, when doing this for real, with live students, recently I had =20=
som trouble. The students sit on individual desktops and ssh into =20
their "own" domUs. Typically the physical server bogged down =20
entirely on occasion, the interrupt rate on the dom0 reached 100% and =20=
the network interface started to have device timeouts.
=46rom there on things went downhill, almost impossible to get shell =20=
access to respond at all. In the end I typically unplugged the =20
physical network (to separate the physical servers) to try to =20
recover. Usually it did recover, although it took several minutes.
Because this was a training environment I unfortunately did not have =20
much opportunity to debug this (students to take care of), so I just =20
left it on its own and hoped for recovery. Therefore there's not much =20=
hard data other than:
* 100% interrupt rate
* oodles of "sip0: FIFO ring overrun" on one server
* "fxp0: device timeout" on the other server
* oodles of "nfs_timer: ignoring error 64" on all domUs
What I did *not* find was any massiv network traffic. I.e. no raging =20
storms that I could see.
One other thing I did notice was that on occasion the remote access =20
to the domUs failed (i.e. the connection appeared to hang), while the =20=
actual machines were just fine. An "ifconfig fxp0 down / ifconfig =20
fxp0 up" seemed to clear that usually if I was quick about it. If =20
busy with other stuff I think that evolved into the general catatonic =20=
I'm really sorry I don't have more detailed information.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
-----END PGP SIGNATURE-----