tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Xen 3.3: Problem HVM guest



On Friday 15 August 2008 10:49:16 Christoph Egger wrote:
> On Friday 15 August 2008 10:18:31 Christoph Egger wrote:
> > On Thursday 14 August 2008 23:39:23 Christoph Egger wrote:
> > > Manuel Bouyer wrote:
> > > > On Thu, Aug 14, 2008 at 08:25:14PM +0200, Christoph Egger wrote:
> > > >>> Not really, as the write which is failing is also in dom0 (so on
> > > >>> the same CPU). I think the tlb should be properly invalidated. Just
> > > >>> to make sure you can try adding
> > > >>> pmap_tlb_shootdown(pmap, va, 0, opte);
> > > >>> just after xpq_update_foreign() in pmap_enter_ma(). But as we're
> > > >>> switching pmaps on return to userland, this shouldn't be needed.
> > > >>
> > > >> This has no impact.
> > > >
> > > > As expected ... I'm running out of idea. I'll try to reproduce this
> > > > on my test box, but it won't be before next week.
> > >
> > > I found the bug:
> > >  >>>>> - instrument privpgop_fault() to see if it gets called at all
> > >  >>>>> for this mapping, and if it's doing the right thing.
> > >  >>>>>   There should be only one page in this object, and the machine
> > >  >>>>>   address should be 0 (pobj->maddr[maddr_i])
> > >  >>>>
> > >  >>>> Yes, privpgop_fault() is called. It looks like it's called in a
> > >  >>>> loop. npages = 1 and machine address is 0.
> > >  >>>
> > >  >>> OK, it has the right data. I guess it's called in a loop because
> > >  >>> writing at the page keeps failing.
> > >
> > > Writing at the page keeps failing because privpgop_fault()
> > > does not handle this case:
> > >
> > >           if (pobj->maddr[maddr_i] == 0)
> > >                continue; /* this has already been flagged as error */
> > >
> > > Removing this makes privpgop_fault() calling pmap_enter_ma()
> > > and that makes the write access finally succeed and the HVM guest
> > > starts.
> > >
> > > May I commit this change?
> >
> > The story is not over yet. When running a HVM guest, the machine
> > suddenly freezes with this message:
> >
> > Mutex error: mutex_spin_retry: locking against myself
> >
> > lock address : 0xffffffff80b86a80
> > current cpu  :                  0
> > current lwp  : 0xffffa000257e47e0
> > owner field  : 0x0000000000010700 wait/spin:                0/1
> >
> > The machine freezes absolutely: No keyboard interrupt, no serial console
> > and no network is working. The machine can't be pinged from outside.
> >
> >
> > What I figured out so far:
> >
> > a) I can only reproduce this with / on nfs. (So is this NetBSD/Xen
> > specific? ) b) The values are always the same.
>
> A LOCKDEBUG  Dom0 kernel panics with this:
>
> Mutex error: mutex_vector_enter: locking against myself
>
> lock address : 0xffffa00023206f48 type     :               spin
> initialized  : 0xffffffff803f5dee
> shared holds :                  0 exclusive:                  0
> shares wanted:                  0 exclusive:                  1
> current cpu  :                  0 last held:                  0
> current lwp  : 0xffffa000260247c0 last held: 000000000000000000
> last locked  : 0xffffffff80407e00 unlocked : 0xffffffff80407e7e
> owner field  : 0x0000000000010700 wait/spin:                0/1
>
> panic: LOCKDEBUG
> fatal breakpoint trap in supervisor mode
> trap type 1 code 0 rip ffffffff804b936d cs e030 rflags 246 cr2
> ffffa000256d5000 cpl 8 rsp ffffa000260b7088
> Stopped in pid 457.1 (qemu-dm) at       netbsd:breakpoint+0x5:  leave
> breakpoint() at netbsd:breakpoint+0x5
> panic() at netbsd:panic+0x255
> lockdebug_abort1() at netbsd:lockdebug_abort1+0xd3
> mutex_vector_enter() at netbsd:mutex_vector_enter+0x1f0
> sleepq_remove() at netbsd:sleepq_remove+0x107
> cv_wakeup_all() at netbsd:cv_wakeup_all+0x81
> knote_activate() at netbsd:knote_activate+0x84
> knote() at netbsd:knote+0x36
> selnotify() at netbsd:selnotify+0x25
> logwakeup() at netbsd:logwakeup+0x3f
> printf() at netbsd:printf+0xfc
> xen_correctable_handler() at netbsd:xen_correctable_handler+0x25
> Xresume_xenev8() at netbsd:Xresume_xenev8+0x55
> --- interrupt ---
> Xspllower() at netbsd:Xspllower+0xe
> mi_switch() at netbsd:mi_switch+0x12e
> sleepq_block() at netbsd:sleepq_block+0xa0
> selcommon() at netbsd:selcommon+0x738
> sys_select() at netbsd:sys_select+0x6a
> syscall() at netbsd:syscall+0x98
> ds          0x7414
> es          0x7098
> fs          0x7414
> gs          0x8
> rdi         0x8
> rsi         0xdeadbeef
> rbp         0xffffa000260b7088
> rbx         0xffffa000260b7098
> rdx         0
> rcx         0
> rax         0x1
> r8          0xffffa000260b6fa8
> r9          0x1
> r10         0xffffa000260b6fa8
> r11         0xffffffff804f5560  xenconscn_putc
> r12         0x100
> r13         0xffffffff80867414  copyright+0x1a254
> r14         0x8
> r15         0x1
> rip         0xffffffff804b936d  breakpoint+0x5
> cs          0xe030
> rflags      0x246
> rsp         0xffffa000260b7088
> ss          0xe02b
> netbsd:breakpoint+0x5:  leave
> db> ps /l
>  PID         LID S     FLAGS       STRUCT LWP *               NAME WAIT
>
> >457           3 3   1000084   ffffa00025d067c0            qemu-dm aiowork
>
>                2 3   1000084   ffffa00026024000            qemu-dm netio
>
>            >   1 3     40084   ffffa000260247c0            qemu-dm select
>
> [...]
>
> Christoph

This is a non-issue for those using -current from cvs. I have some local
changes adding some basic machine check support in order to test
the new machine check infrastructure in Xen 3.3.

On one certain machine, I have the strange effect that the
Dom0 machine check handler gets invoked w/o actually having
the hw NOT reporting an error. The error causing this LOCKDEBUG
panic is that I dump the error telemetry via printf() instead of
printf_nolog(). Using printf_nolog() actually fixes this
LOCKDEBUG panic.

Christoph


Home | Main Index | Thread Index | Old Index