Subject: Hard to track problem
To: None <port-xen@netbsd.org>
From: Konrad Neuwirth <konrad@mailathome.or.at>
List: port-xen
Date: 10/25/2006 10:53:10
One of our Xen systems reliably stops working in an interesting, but not
entirely easy-to-understand way.

First, a sketch of our general configuration:

We're running two mostly identical systems in a remote data center.  The
systems run 4.0_BETA on Xen 2. Hardware-wise, they're AMD64 (single core
running NetBSD/i386 because of Xen; they have re-driven Realtek 8169B
ethernet on board.  The card is driven by the dom0 but doesn't have a
publicly reachable IP address there -- the first interface responding to
outside traffic is a single-purpose dom0 that does firewalling and
routing to the other domUs.

Because we don't have a private network at the colocation center, the
router domUs also have an IPsec-based VPN amongst each other.  The domUs
run ucarp over that VPN so that we can fail over single domUs if
something happens.

One of the domUs is our 'database engine', running pgpool as the HA
layer atop of postgresql so we can keep the databases in sync on both
machines easily.  But as soon as we feed a larger database dump onto
pgpool, the entire machine stops to respond to outside network traffic.
The dom0 console logs a "re0: watchdog timeout", pinging outside
machines from one of the domUs leads to a "no buffers available" message
on the xennet interface. First tests indicate that the problem is that
too much network traffic is generated at once -- feeding the dump to
just one postgresql server is fine, as soon as pgpool starts to
distribute the transactions onto both machines, the one sending the feed
is dead again.

We've already increased NMBCLUSTERS to 20480 -- so I suppose that should
suffice for a while.  Alas, I have no idea on how to debug this.  What
buffers could we still increase? And -- how can we trace down this
problem well enough to stop it from happening?

Cheers,
 Konrad