Subject: Domains stuck in "shutdown" state
To: None <port-xen@NetBSD.org>
From: Jed Davis <jdev@panix.com>
List: port-xen
Date: 06/24/2005 03:34:06
I've discovered that, by shutting down a lot of domains at once, I can
get some stuck in a state like this:

  Name              Id  Mem(MB)  CPU  State  Time(s)  Console
  Domain-0           0      127    0  r----     29.0        
  unreal02           3        0    0  ---s-      4.1    9603

Attempts to "xm destroy" the domain do nothing.  That one zombie was
obtained by shutting down 12 guests on my earlier-mentioned P4 with
hyperthreading disabled; if I do that with HT on, I generally wind up
with 11 of the zombies.  With HT on and "sleep 25" in between the "xm
shutdown"s, maybe 6 of them.

Meanwhile, every two seconds, xend logs this:

  [2005-06-23 19:54:09 xend] DEBUG (XendDomain:244) XendDomain>reap> domain died name=unreal02 id=3
  [2005-06-23 19:54:09 xend] DEBUG (XendDomain:247) XendDomain>reap> shutdown id=3 reason=poweroff
  [2005-06-23 19:54:09 xend] DEBUG (XendDomain:487) domain_restart_schedule> 3 poweroff 0
  [2005-06-23 19:54:09 xend] INFO (XendDomain:564) Destroying domain: name=unreal02
  [2005-06-23 19:54:09 xend] DEBUG (XendDomainInfo:634) Closing console, domain 3
  [2005-06-23 19:54:09 xend] INFO (XendRoot:112) EVENT> xend.domain.exit ['unreal02', '3', 'poweroff']
  [2005-06-23 19:54:09 xend] INFO (XendRoot:112) EVENT> xend.domain.destroy ['unreal02', '3']

With ktrace/kdump I can see it doing a bunch of dom0_op hypercalls (but
of course I can't follow the pointer to see the details), and some
of them fail with ESRCH.  So... it almost looks like xend is getting
confused as to which domains are actually up.  (Which, incidentally, is
how I ran into the xend-restart-panic bug: I tried to restart xend to
see if that would clear things up.)

Now, if I try to restart that domain, I get this:

  Name              Id  Mem(MB)  CPU  State  Time(s)  Console
  Domain-0           0      127    0  r----     35.7        
  Domain-13         13       64    0  --p--      0.0        
  unreal02          13       63    0  -b---      3.6    9613

And the new copy of the host works fine (that is, it isn't blatantly
broken), although xend continues to attempt to destroy domain 3 (not 13)
every few minutes, and of course the domain list is a little screwed up.

So, if anyone who knows more about the innards of this stuff can suggest
where to look next, at least to see if it's NetBSD or xend that might be
responsible for this, that would be nice.  (Though one would think that,
if it were OS-independent, someone would have noticed and fixed it.)


-- 
(let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map
((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l))))))  (lambda
(f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l))
(C k)))))))    '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline)))))