Subject: Re: Hard to track problem
To: Konrad Neuwirth <konrad@mailathome.or.at>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-xen
Date: 10/25/2006 13:11:44
On Wed, Oct 25, 2006 at 10:53:10AM +0200, Konrad Neuwirth wrote:
> One of our Xen systems reliably stops working in an interesting, but not
> entirely easy-to-understand way.
> 
> First, a sketch of our general configuration:
> 
> We're running two mostly identical systems in a remote data center.  The
> systems run 4.0_BETA on Xen 2. Hardware-wise, they're AMD64 (single core
> running NetBSD/i386 because of Xen; they have re-driven Realtek 8169B
> ethernet on board.  The card is driven by the dom0 but doesn't have a
> publicly reachable IP address there -- the first interface responding to
> outside traffic is a single-purpose dom0 that does firewalling and
> routing to the other domUs.
> 
> Because we don't have a private network at the colocation center, the
> router domUs also have an IPsec-based VPN amongst each other.  The domUs
> run ucarp over that VPN so that we can fail over single domUs if
> something happens.
> 
> One of the domUs is our 'database engine', running pgpool as the HA
> layer atop of postgresql so we can keep the databases in sync on both
> machines easily.  But as soon as we feed a larger database dump onto
> pgpool, the entire machine stops to respond to outside network traffic.
> The dom0 console logs a "re0: watchdog timeout", pinging outside
> machines from one of the domUs leads to a "no buffers available" message
> on the xennet interface. First tests indicate that the problem is that
> too much network traffic is generated at once -- feeding the dump to
> just one postgresql server is fine, as soon as pgpool starts to
> distribute the transactions onto both machines, the one sending the feed
> is dead again.
> 
> We've already increased NMBCLUSTERS to 20480 -- so I suppose that should
> suffice for a while.  Alas, I have no idea on how to debug this.  What
> buffers could we still increase? And -- how can we trace down this
> problem well enough to stop it from happening?

My guess it that it's a bug in the re(4) driver. Could you try using
the rtk driver instead (if it supports your card), or using -current ?
There have been some fixes to re(4) in current which may solve your
problem.

-- 
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
     NetBSD: 26 ans d'experience feront toujours la difference
--