Subject: Re: System hangs during daily jobs
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Jeff Rizzo <riz@tastylime.net>
List: port-xen
Date: 06/10/2006 09:20:55
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig8CB39654DD7EB157902BFD33
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Manuel Bouyer wrote:
> On Fri, Jun 09, 2006 at 07:24:23AM -0700, Jeff Rizzo wrote:
>  =20
>
> Please note (in case you didn't) that the magic string is +++++ on a
> Xen kernel, and not break (the serial console is managed by xen, so we
> can't see the break).
>  =20

Doh!  I knew it was probably something like that; I should have looked
harder.  :}  I managed to get a backtrace from the dom0 kernel this time.=


>
> Could you try running in UP mode (I think it's 'nosmp' on the xen comma=
nd
> line, or something like that) and see if it helps ? Next thing to try
> is to run a SMP Xen, but with both domains forced on cpu 0.
>  =20

I may try that, if I can set up a situation where I can force the crash
at will (since I don't really want to wait 24h each time I tweak
something if I can help it).  Since it seems to happen during the daily
job consistently, I will see if running them from the commandline will
trigger the hang.

> Also, you could try using 'q' after ^A^A^A, to see the state of
> domains, and other usefull infos (the NetBSD dom0 kernel should print
> a few things too, it can be an indication on how hard it's hung)
>
>  =20


Below is the backtrace from ddb, and the output from the Xen kernel.  (I
don't know anything about the Xen output - I assume the
apparently-interlaced-with-other-stuff output is due to both dom0 and
Xen outputting.

Stopped at      netbsd:cpu_Debugger+0x4:        leave
db> bt
cpu_Debugger(6dcc1c80,c09e2000,c09e2150,1,10) at netbsd:cpu_Debugger+0x4
xencons_tty_input(c0a6dc00,c055a930,1,10,7) at netbsd:xencons_tty_input+0=
xa9
xencons_intr(c0a6dc00,c062ab1c,0,c0aec100,0) at netbsd:xencons_intr+0x47
evtchn_do_event(4,c062ab1c,0,ab24,0) at netbsd:evtchn_do_event+0x9f
do_hypervisor_callback(c062ab1c,0,3b9a0011,31,11) at
netbsd:do_hypervisor_callba
ck+0xad
hypervisor_callback(c0574c80,0,0,c02f472d,c0575000) at
netbsd:hypervisor_callbac
k+0x64
cpu_switch(c0575000,0,cbcd7000,c02a99fe,c054fbc0) at netbsd:cpu_switch+0x=
d7
ltsleep(c0574c80,4,c04d464f,0,0) at netbsd:ltsleep+0x427
uvm_scheduler(c0573288,0,c0572b18,c04b80d4,c037351c) at
netbsd:uvm_scheduler+0xa
a
main(c0100177,c010017f,0,0,0) at netbsd:main+0x4f1
db> (XEN) *** Serial input -> Xen (type 'CTRL-a' three times to switch
input to DOM0).
(XEN) 'q' pressed -> dumping task queues (now=3D0x547C:6E183AC1)
(XEN) Xen: DOM 0, CPU 0 [has=3DT] flags=3D106d refcnt=3D2 nr_pages=3D4913=
5
xenheap_pages=3D2
(XEN) Shared_info@00be6000: caf=3D80000003, taf=3Df0000003
(XEN) Guest: upcall_pend =3D 00, upcall_mask =3D 00
(XEN) Notifying guest...
MdXeEbNu) Xegn :eve nDOt
3i_i, CPlUe v1el  0[hxac sci_i=3DpeT]n difngla 0x83g0s=3D 10ci_0fide prte=
hf
c1nt=3D
kr_paegevstchn_u=3Dpc6al5l_pen536ding xe n0 heevatpc_hpan_upcgaell_msa=3D=
2s
  (X1 EevN)t chSn_hpared_einfo@0n0dinbdgd0_se00l: caf=3D 080x0
evtchn00_m003, atsakf =3Df00ff00ff90503
b(2X ffENf) Gufffefsft f:f uffpfcffaf fllfffff_pffe ndf f=3D 0f0,f ufpfcf=
f
allffff_mafskfff =3D ffff 0f0fff ff
f(XEfffN)f f Noftifyifng fguffefsfft .fff..
fffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
evtchn_pending 1410 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0


and for good measure (not sure if it's useful here), here's the dump of
the run queues:
(XEN) Scheduler: Borrowed Virtual Time (bvt)
(XEN) BVT: mcu=3D0x000186A0ns ctx_allow=3D0x004C4B40ns NOW=3D0x000054FDC4=
42960F
(XEN) CPU[00] svt=3D0x3D1C6C6C QUEUE rq fcffd120   n: fcffc084, p: fcffc2=
78
(XEN)   0: 32767 has=3DF mcua=3D10 ev=3D0xFFFFFFFF av=3D0xFFFFFFFF c=3D0x=
4E2DEE8EE4F9
(XEN)          l: fcffc084 n: fcffc278  p: fcffd120
(XEN)   1: 0 has=3DT mcua=3D10 ev=3D0x3D1C6C6C av=3D0x3D1C6C6C c=3D0x6CFD=
D13C31C
(XEN)          l: fcffc278 n: fcffd120  p: fcffc084
(XEN) CPU[01] svt=3D0x88F35DCC QUEUE rq fcffd140   n: fcffc214, p: fcffc2=
14
(XEN)   0: 32767 has=3DT mcua=3D10 ev=3D0xFFFFFFFF av=3D0xFFFFFFFF c=3D0x=
3FE7740AFDE1
(XEN)          l: fcffc214 n: fcffd140  p: fcffd140


Unfortunately, I never set up a dump device on this machine, so I can't
get a crash dump.  (Would that even help?)

thanks,

+j



--------------enig8CB39654DD7EB157902BFD33
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRIrxaLOuUtxCgar5AQOuaQQAixpkI1JcKEZhZW5ohtX3TaektWYSxclv
IKoz/KThjSd7xbdqf25QKU5r1xNlWW4Gq7481YJOyqU9F6AeZKaI+IeoiPT3gGEV
m1rixmEnqG7J5UohXQQVgpfD5JDPOI0dUBup8R+vm1GjdAktyiHLQXHc5uiGSPUI
jAkf1N4I34g=
=USBh
-----END PGP SIGNATURE-----

--------------enig8CB39654DD7EB157902BFD33--