Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Xen/Xentools 3.3 Domain-Unnamed



Christoph Egger wrote:
On Friday 15 August 2008 12:00:51 bsd-xen%roguewrt.org@localhost wrote:
Christoph Egger wrote:
On Friday 15 August 2008 08:55:55 Sarton O'Brien wrote:
When shutting down or rebooting a domu I'm left with:

Domain-Unnamed                               1   467     1     ---s--
46.0

In 'xm list' ... but not in 'xm top' (it displays for little while in
'xm top' then disappears). If rebooting, the domu will not start up by
itself.

I can't kill it and when doing multiple domu reboots, only one ever
exists.
Everytime I try to destroy it, I see it freed some memory. So repeating
xm destroy<domain>  ; xm list  kills it finally.

There's something asynchronous within the hypervisor which obviously
needs to be debugged.

This happens for both PV and HVM guests.

I tracked this issue down. The root cause is a discrepancy in the
error *value* codes between AT&T Unix Version 6 and Unix System V.

Linux, Xen and Solaris use the Unix System V error codes.
*BSD uses the AT&T Unix Version 6 error codes.

After shutting down (or rebooting) a domU, the guest container gets destroyed.
This implies freeing resources used by the guest (RAM, internal management structures, etc.).

The destroy process is an asynchronous process in order to not block the Dom0
(and other DomUs).

The destroy process works this way:

The XEN_DOMCTL_destroydomain is invoked from the xentools (python, libxc code).
In the hypervisor:
XEN_DOMCTL_destroydomain hypercall calls domain_kill().
domain_kill() calls domain_relinquish_resources().
domain_relinquish_resources() calls relinquish_memory().
relinquish_memory() calls hypercall_preempt_check().

hypercall_preempt_check() makes all this asynchronous.
It fails, if there's an other hypercall pending.
In that case relinquish_memory() returns EAGAIN, which
means, just retry to continue the destroy process.

EAGAIN is passed through the return path back into the pyton code
 (= userspace). The python code checks for EAGAIN and *should*
retry, but it doesn't.


In Unix System V, EAGAIN has the error code value 11.
In AT&T Unix Version 6, EDEADLK has the error code value 11.

Remember I said, Xen uses Unix System V error code values, while
*BSD uses AT&T Unix Version 6 error code values.

This means, Xen returning EAGAIN means for the python code
EDEADLK. This leads to the confusing
"domain destroy failed due to 'Resource deadlock avoided'"
error message.

I informed XenSource about this to find a solution.

Nice catch ... and thanks for all the information! Greatly appreciated.

Sarton


Home | Main Index | Thread Index | Old Index