Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Problems with many DOMUs on a single DOM0.



At the risk of being completely and utterly wrong in a public forum, I would 
suggest
you look at your open file descriptor limits to at least rule out the 
possibility that 
xenconsoled is running out of file descriptors for the pty's it's managing. I 
haven't looked into this too deeply or examined the source, but lsof seems to 
indicate that there are 2 fd used per domU ( which makes sense ), plus a few 
used for overhead. It wouldn't take long to run out if you didn't take some 
steps to increase things from the defaults. 

I've been using xen for a long time now -- nearly a decade -- and with each 
major version, the console support seems to be improving. Once 4.2 hits pkgsrc 
and has a chance to gel a bit, you may want to consider upgrading. 

Also, with that many domUs, even with SSD, that's a lot of backend I/O, so 
you'll also want to the normal steps to make sure dom0 gets the resources it 
needs. 

Best of luck!

Harry Waddell 


On Mon, 7 Jan 2013 18:53:06 +0100
Johan Ihrén <johani%johani.org@localhost> wrote:

> Hi,
> 
> All of this is NetBSD-6.0, XEN 3.3.2, with ptyfs mounted, all VND-devices 
> created, etc. However, the results are basically the same for 5.2. I have 
> looked at the XEN logs, but haven't found any clues there.
> 
> I run many DOMUs on the same DOM0. No need for optimal performance, but 
> strong need for many separate DOMUs. They are all file-backed, using VND and 
> PV (not HVM). The DOM0 is always amd64, while the DOMUs used to be i386pae, 
> but I'm migrating them to also be amd64.
> 
> Previously over the years I've been limited by CPU, by disk IO, by available 
> memory, etc, to make the reasonable limit around 30 DOMUs on a quad core box 
> with 8GB memory and four SSDs, and that works like a charm. I.e. I've been 
> constrained by the hardware, not the OS.
> 
> But I would like to get to around 50-60 DOMUs and current hardware has enough 
> cores and memory to provide that without too much fuss. I.e. if there are 
> constraints now, they are likely OS or XEN constraints.
> 
> And I'm running into problems. Several problems actually.
> 
> As I start more DOMUs eventually I reach a point where the consoles no longer 
> work:
> ------
> witch:labconfig# xm console domu38
> NetBSD/amd64 (domu38) (console)
> login:                                 # login prompt, this DOMU is fine
> 
> witch:labconfig# xm console domu39     # this one, however, is not:
> 
> xenconsole: Could not read tty from store: No such file or directory
> ------
> It is interesting to note that the limit is "soft" in the sense that if I 
> kill a couple of machines it is possible to start a few other ones that will 
> then get working consoles. I.e. it is not a permanent resource exhaustion.
> 
> What's also interesting, though, is that sometimes (but not always) "domu39" 
> is fine, except for the lack of a console. I.e. as long as I don't screw up 
> my networking, I can add some more DOMUs... until I hit the next problem. 
> This time, all machines up to and including "domu44" was ok. But "dom45" is 
> not working ("not working" defined as "doesn't respond to ping").
> 
> There's another problem with non-working DOMUs, and that is that they tend to 
> go to 100% CPU and stay there. It is not exactly clear to me when this 
> happens. Sometimes it is immediately when the DOMU is created, sometimes I've 
> been able to use a DOMU for hours with no problems (except lack of console) 
> and then it goes to 100% CPU when try to kill it off with "xm shutdown" 
> (which doesn't work). "xm destroy" does kill them off, though.
> 
> And now it gets really strange. If I kill off the non-working DOMUs with "xm 
> destroy" and then start them again then sometimes they work (still no 
> console, but networking ok, so it is possible to get to them). This way, by 
> booting DOMUs, and destroying and rebooting them until they work, I've been 
> able to get to 52 working DOMUs, which is enough for me. But the last few 
> machines are really skittish and may require several restarts before they 
> work at all.
> 
> And sometimes (but not always) I get problems with xend:
> ------
> Unable to connect to xend: Connection refused. Is xend running?
> ------
> xend IS running. But not functioning for some reason.
> 
> When this happens, it is not possible to restart xend with "/etc/rc.d/xend 
> restart". Only way to kill xend is with "kill -9" (it is in state "Il"). But 
> once xend is restarted it is possible to recover without rebooting.
> 
> The first problem (no console for machines ~40 and up) is likely some sort of 
> PTY resource exhaustion, although I don't understand why or where. When it 
> happens I've run a small python script to check whether (the python) openpty 
> function is able to allocate a PTY and that seems to work ok. I used python 
> only because xen is written in python. Other suggestions for what to try 
> would be appreciated.
> 
> The second problem (some DOMUs going to 100% CPU and in general not 
> functioning) is probably more difficult. But without a console it is 
> difficult to diagnose.
> 
> The third problem (xend becoming catatonic) happens less frequently, and 
> sometimes not at all. And as it is possible to recover by killing xend and 
> restarting it it is less of a pain than the others. But there's still a 
> problem in there somewhere.
> 
> Suggestions anyone?
> 
> Regards,
> 
> Johan Ihrén
> 
> 



Home | Main Index | Thread Index | Old Index