evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)

To: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Subject: evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Tue, 03 Dec 2013 08:55:06 +0700

    Date:        Mon, 2 Dec 2013 12:29:13 +0100
    From:        Manuel Bouyer <bouyer%antioche.eu.org@localhost>
    Message-ID:  <20131202112913.GA892%asim.lip6.fr@localhost>

  | >     evtchn_do_event: handler 0xffffffff801f9c7f didn't lower ipl 8 7

  | Yes, is has been there for a long time, but we failed to find where is
  | problem is exactly.

I didn't get quite as close as I'd hoped, and I probably haven't
discovered anything that wasn't already known, but ...

For me at least, that message is triggered from statclock() called from
hardclock() - when statclock() starts the ipl is 7, when it exits, it is 8.

But it's not statclock's fault ... the ipl is raised as a side effect or
mutex_spin_enter - on the other hand, mutex_spin_exit only restores the
ipl when the mutex being released is the last one held by the CPU.

When statclock() - and hardclock() before it - is (or are) called, the
cpu (apparently) already holds a (spin) mutex (the ci_mtx_count field of
the cpu_info struct is -1).    Given that, and the way spin mutexes work,
statclock() (and then hardclock()) must return with the ipl higher.

What I haven't found yet is where that extra spin mutex is grabbed - but
it looks as if either there must be one, or the spin_mutex code has bugs
(and the latter doesn't seem likely - it would need to be some kind of
race against an interrupt handler, as this is a XEN Dom0 I am testing using,
and consequently, acts as if there is just one cpu).   In any case, things
are far too repeatable, for it to be some bizarre random timing event with
an interrupt just at the wrong nanosecond...

Unless someone else suddenly realises what's up - even if this is really
just SOP, that is, if there's some mutex that's meant to be held, and the
problem is the diagnostic objecting to normal expected conditions, then
the next time I get some time to look into this, I'll probably add some
instrumentation to the spin mutex functions, to record from where, and
which, mutexes are being locked, so when the (seemingly) spurious one is
detected, we can find out where it came from.

At the minute I also can't explain how there is apparently a spin mutex
still held, yet the IPL is not IPL_HIGH ... that also suggests some kind
of problem I believe - when the first spin mutex is grabbed, the ipl
should be being raised to IPL, and it should stay there until no more spin
locks are held.   Yet I (repeatedly, and reliably) observed statclock()
(and hardclock(), and other) functions being called with ipl lower than
IPL_HIGH (7 rather than 8) while ci_mtx_count is indicating a mutex is
held.    Weird.

kre

Follow-Ups:
- Re: evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)
  - From: Robert Elz
- Re: evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)
  - From: David Laight
- Re: evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)
  - From: Robert Elz
- Re: evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)
  - From: Manuel Bouyer

References:
- Re: xl or xm for xen ? (and more questions, maybe raidframe related)
  - From: Manuel Bouyer
- xl or xm for xen ? (and more questions, maybe raidframe related)
  - From: Robert Elz

Prev by Date: Your E-mail Has Won 1.5M USD
Next by Date: daily CVS update output
Previous by Thread: Re: xl or xm for xen ? (and more questions, maybe raidframe related)
Next by Thread: Re: evtchn_do_event: handler...didn't lower ipl (Was: Re: xl or xm for xen)
Indexes:

Home | Main Index | Thread Index | Old Index