Subject: Re: System hangs during daily jobs
To: Jeff Rizzo <riz@tastylime.net>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-xen
Date: 06/10/2006 19:15:46
On Sat, Jun 10, 2006 at 09:20:55AM -0700, Jeff Rizzo wrote:
> Manuel Bouyer wrote:
> > On Fri, Jun 09, 2006 at 07:24:23AM -0700, Jeff Rizzo wrote:
> >
> >
> > Please note (in case you didn't) that the magic string is +++++ on a
> > Xen kernel, and not break (the serial console is managed by xen, so we
> > can't see the break).
> >
>
> Doh! I knew it was probably something like that; I should have looked
> harder. :} I managed to get a backtrace from the dom0 kernel this time.
>
> >
> > Could you try running in UP mode (I think it's 'nosmp' on the xen command
> > line, or something like that) and see if it helps ? Next thing to try
> > is to run a SMP Xen, but with both domains forced on cpu 0.
> >
>
> I may try that, if I can set up a situation where I can force the crash
> at will (since I don't really want to wait 24h each time I tweak
> something if I can help it). Since it seems to happen during the daily
> job consistently, I will see if running them from the commandline will
> trigger the hang.
>
> > Also, you could try using 'q' after ^A^A^A, to see the state of
> > domains, and other usefull infos (the NetBSD dom0 kernel should print
> > a few things too, it can be an indication on how hard it's hung)
> >
> >
>
>
> Below is the backtrace from ddb, and the output from the Xen kernel. (I
> don't know anything about the Xen output - I assume the
> apparently-interlaced-with-other-stuff output is due to both dom0 and
> Xen outputting.
>
> Stopped at netbsd:cpu_Debugger+0x4: leave
> db> bt
> cpu_Debugger(6dcc1c80,c09e2000,c09e2150,1,10) at netbsd:cpu_Debugger+0x4
> xencons_tty_input(c0a6dc00,c055a930,1,10,7) at netbsd:xencons_tty_input+0xa9
> xencons_intr(c0a6dc00,c062ab1c,0,c0aec100,0) at netbsd:xencons_intr+0x47
> evtchn_do_event(4,c062ab1c,0,ab24,0) at netbsd:evtchn_do_event+0x9f
> do_hypervisor_callback(c062ab1c,0,3b9a0011,31,11) at
> netbsd:do_hypervisor_callba
> ck+0xad
> hypervisor_callback(c0574c80,0,0,c02f472d,c0575000) at
> netbsd:hypervisor_callbac
> k+0x64
> cpu_switch(c0575000,0,cbcd7000,c02a99fe,c054fbc0) at netbsd:cpu_switch+0xd7
> ltsleep(c0574c80,4,c04d464f,0,0) at netbsd:ltsleep+0x427
> uvm_scheduler(c0573288,0,c0572b18,c04b80d4,c037351c) at
> netbsd:uvm_scheduler+0xa
> a
> main(c0100177,c010017f,0,0,0) at netbsd:main+0x4f1
Hum. This reminds me of the "sleep forever" bug on sparc64. Can you try
setting a breakpoint on softclock() and see if it gets called ?
If not, try a breakpoint on hardclock()
> db> (XEN) *** Serial input -> Xen (type 'CTRL-a' three times to switch
> input to DOM0).
> (XEN) 'q' pressed -> dumping task queues (now=0x547C:6E183AC1)
> (XEN) Xen: DOM 0, CPU 0 [has=T] flags=106d refcnt=2 nr_pages=49135
> xenheap_pages=2
> (XEN) Shared_info@00be6000: caf=80000003, taf=f0000003
> (XEN) Guest: upcall_pend = 00, upcall_mask = 00
> (XEN) Notifying guest...
> MdXeEbNu) Xegn :eve nDOt
> 3i_i, CPlUe v1el 0[hxac sci_i=peT]n difngla 0x83g0s= 10ci_0fide prtehf
> c1nt=
> kr_paegevstchn_u=pc6al5l_pen536ding xe n0 heevatpc_hpan_upcgaell_msa=2s
> (X1 EevN)t chSn_hpared_einfo@0n0dinbdgd0_se00l: caf= 080x0
> evtchn00_m003, atsakf =f00ff00ff90503
> b(2X ffENf) Gufffefsft f:f uffpfcffaf fllfffff_pffe ndf f= 0f0,f ufpfcff
> allffff_mafskfff = ffff 0f0fff ff
> f(XEfffN)f f Noftifyifng fguffefsfft .fff..
> fffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> evtchn_pending 1410 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0
Hum, I've never seen this on my systems, xen should serialize the outputs
(as it owns the console). You don't have an entry for the serial port
used as console in your dom0 kernel, do you ?
There seems to be pending events, though. What does "dmesg |grep event" say ?
>
> and for good measure (not sure if it's useful here), here's the dump of
> the run queues:
> (XEN) Scheduler: Borrowed Virtual Time (bvt)
> (XEN) BVT: mcu=0x000186A0ns ctx_allow=0x004C4B40ns NOW=0x000054FDC442960F
> (XEN) CPU[00] svt=0x3D1C6C6C QUEUE rq fcffd120 n: fcffc084, p: fcffc278
> (XEN) 0: 32767 has=F mcua=10 ev=0xFFFFFFFF av=0xFFFFFFFF c=0x4E2DEE8EE4F9
> (XEN) l: fcffc084 n: fcffc278 p: fcffd120
> (XEN) 1: 0 has=T mcua=10 ev=0x3D1C6C6C av=0x3D1C6C6C c=0x6CFDD13C31C
> (XEN) l: fcffc278 n: fcffd120 p: fcffc084
> (XEN) CPU[01] svt=0x88F35DCC QUEUE rq fcffd140 n: fcffc214, p: fcffc214
> (XEN) 0: 32767 has=T mcua=10 ev=0xFFFFFFFF av=0xFFFFFFFF c=0x3FE7740AFDE1
> (XEN) l: fcffc214 n: fcffd140 p: fcffd140
>
>
> Unfortunately, I never set up a dump device on this machine, so I can't
> get a crash dump. (Would that even help?)
Not sure, and I'm not sure a Xen kernel can do a crash dump anyway ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--