Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Problems with many DOMUs on a single DOM0.


All of this is NetBSD-6.0, XEN 3.3.2, with ptyfs mounted, all VND-devices 
created, etc. However, the results are basically the same for 5.2. I have 
looked at the XEN logs, but haven't found any clues there.

I run many DOMUs on the same DOM0. No need for optimal performance, but strong 
need for many separate DOMUs. They are all file-backed, using VND and PV (not 
HVM). The DOM0 is always amd64, while the DOMUs used to be i386pae, but I'm 
migrating them to also be amd64.

Previously over the years I've been limited by CPU, by disk IO, by available 
memory, etc, to make the reasonable limit around 30 DOMUs on a quad core box 
with 8GB memory and four SSDs, and that works like a charm. I.e. I've been 
constrained by the hardware, not the OS.

But I would like to get to around 50-60 DOMUs and current hardware has enough 
cores and memory to provide that without too much fuss. I.e. if there are 
constraints now, they are likely OS or XEN constraints.

And I'm running into problems. Several problems actually.

As I start more DOMUs eventually I reach a point where the consoles no longer 
witch:labconfig# xm console domu38
NetBSD/amd64 (domu38) (console)
login:                                 # login prompt, this DOMU is fine

witch:labconfig# xm console domu39     # this one, however, is not:

xenconsole: Could not read tty from store: No such file or directory
It is interesting to note that the limit is "soft" in the sense that if I kill 
a couple of machines it is possible to start a few other ones that will then 
get working consoles. I.e. it is not a permanent resource exhaustion.

What's also interesting, though, is that sometimes (but not always) "domu39" is 
fine, except for the lack of a console. I.e. as long as I don't screw up my 
networking, I can add some more DOMUs... until I hit the next problem. This 
time, all machines up to and including "domu44" was ok. But "dom45" is not 
working ("not working" defined as "doesn't respond to ping").

There's another problem with non-working DOMUs, and that is that they tend to 
go to 100% CPU and stay there. It is not exactly clear to me when this happens. 
Sometimes it is immediately when the DOMU is created, sometimes I've been able 
to use a DOMU for hours with no problems (except lack of console) and then it 
goes to 100% CPU when try to kill it off with "xm shutdown" (which doesn't 
work). "xm destroy" does kill them off, though.

And now it gets really strange. If I kill off the non-working DOMUs with "xm 
destroy" and then start them again then sometimes they work (still no console, 
but networking ok, so it is possible to get to them). This way, by booting 
DOMUs, and destroying and rebooting them until they work, I've been able to get 
to 52 working DOMUs, which is enough for me. But the last few machines are 
really skittish and may require several restarts before they work at all.

And sometimes (but not always) I get problems with xend:
Unable to connect to xend: Connection refused. Is xend running?
xend IS running. But not functioning for some reason.

When this happens, it is not possible to restart xend with "/etc/rc.d/xend 
restart". Only way to kill xend is with "kill -9" (it is in state "Il"). But 
once xend is restarted it is possible to recover without rebooting.

The first problem (no console for machines ~40 and up) is likely some sort of 
PTY resource exhaustion, although I don't understand why or where. When it 
happens I've run a small python script to check whether (the python) openpty 
function is able to allocate a PTY and that seems to work ok. I used python 
only because xen is written in python. Other suggestions for what to try would 
be appreciated.

The second problem (some DOMUs going to 100% CPU and in general not 
functioning) is probably more difficult. But without a console it is difficult 
to diagnose.

The third problem (xend becoming catatonic) happens less frequently, and 
sometimes not at all. And as it is possible to recover by killing xend and 
restarting it it is less of a pain than the others. But there's still a problem 
in there somewhere.

Suggestions anyone?


Johan Ihrén

Home | Main Index | Thread Index | Old Index