[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Problems with many DOMUs on a single DOM0.
All of this is NetBSD-6.0, XEN 3.3.2, with ptyfs mounted, all VND-devices
created, etc. However, the results are basically the same for 5.2. I have
looked at the XEN logs, but haven't found any clues there.
I run many DOMUs on the same DOM0. No need for optimal performance, but strong
need for many separate DOMUs. They are all file-backed, using VND and PV (not
HVM). The DOM0 is always amd64, while the DOMUs used to be i386pae, but I'm
migrating them to also be amd64.
Previously over the years I've been limited by CPU, by disk IO, by available
memory, etc, to make the reasonable limit around 30 DOMUs on a quad core box
with 8GB memory and four SSDs, and that works like a charm. I.e. I've been
constrained by the hardware, not the OS.
But I would like to get to around 50-60 DOMUs and current hardware has enough
cores and memory to provide that without too much fuss. I.e. if there are
constraints now, they are likely OS or XEN constraints.
And I'm running into problems. Several problems actually.
As I start more DOMUs eventually I reach a point where the consoles no longer
witch:labconfig# xm console domu38
NetBSD/amd64 (domu38) (console)
login: # login prompt, this DOMU is fine
witch:labconfig# xm console domu39 # this one, however, is not:
xenconsole: Could not read tty from store: No such file or directory
It is interesting to note that the limit is "soft" in the sense that if I kill
a couple of machines it is possible to start a few other ones that will then
get working consoles. I.e. it is not a permanent resource exhaustion.
What's also interesting, though, is that sometimes (but not always) "domu39" is
fine, except for the lack of a console. I.e. as long as I don't screw up my
networking, I can add some more DOMUs... until I hit the next problem. This
time, all machines up to and including "domu44" was ok. But "dom45" is not
working ("not working" defined as "doesn't respond to ping").
There's another problem with non-working DOMUs, and that is that they tend to
go to 100% CPU and stay there. It is not exactly clear to me when this happens.
Sometimes it is immediately when the DOMU is created, sometimes I've been able
to use a DOMU for hours with no problems (except lack of console) and then it
goes to 100% CPU when try to kill it off with "xm shutdown" (which doesn't
work). "xm destroy" does kill them off, though.
And now it gets really strange. If I kill off the non-working DOMUs with "xm
destroy" and then start them again then sometimes they work (still no console,
but networking ok, so it is possible to get to them). This way, by booting
DOMUs, and destroying and rebooting them until they work, I've been able to get
to 52 working DOMUs, which is enough for me. But the last few machines are
really skittish and may require several restarts before they work at all.
And sometimes (but not always) I get problems with xend:
Unable to connect to xend: Connection refused. Is xend running?
xend IS running. But not functioning for some reason.
When this happens, it is not possible to restart xend with "/etc/rc.d/xend
restart". Only way to kill xend is with "kill -9" (it is in state "Il"). But
once xend is restarted it is possible to recover without rebooting.
The first problem (no console for machines ~40 and up) is likely some sort of
PTY resource exhaustion, although I don't understand why or where. When it
happens I've run a small python script to check whether (the python) openpty
function is able to allocate a PTY and that seems to work ok. I used python
only because xen is written in python. Other suggestions for what to try would
The second problem (some DOMUs going to 100% CPU and in general not
functioning) is probably more difficult. But without a console it is difficult
The third problem (xend becoming catatonic) happens less frequently, and
sometimes not at all. And as it is possible to recover by killing xend and
restarting it it is less of a pain than the others. But there's still a problem
in there somewhere.
Main Index |
Thread Index |