Re: CVS commit: src/sys/arch/xen

To: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Subject: Re: CVS commit: src/sys/arch/xen
From: Cherry G. Mathew <cherry%zyx.in@localhost>
Date: Mon, 29 Aug 2011 15:03:37 +0200

Hi Manuel,

>>>>> "Manuel" == Manuel Bouyer <bouyer%antioche.eu.org@localhost> writes:

[...]

    >> 
    >> Xen's TLB_FLUSH operation is synchronous, and doesn't require an
    >> IPI (within the domain), which makes the queue ordering even more
    >> important (to make sure that stale ptes are not reloaded before
    >> the per-cpu queue has made progress). Yes, we can implement a
    >> roundabout ipi driven queueflush + tlbflush scheme(described
    >> below), but that would be performance sensitive, and the basic
    >> issue won't go away, imho.
    >> 
    >> Let's stick to the xpq ops for a second, ignoring "out-of-band"
    >> reads (for which I agree that your assertion, that locking needs
    >> to be done at a higher level, holds true).
    >> 
    >> The question here, really is, what are the global ordering
    >> requirements of per-cpu memory op queues, given the following
    >> basic "ops":
    >> 
    >> i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE)
    >> ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE
    >> MMUEXT_PIN_L3_TABLE MMUEXT_PIN_L4_TABLE MMUEXT_UNPIN_TABLE

    Manuel> This is when adding/removing a page table from a pmap. When
    Manuel> this occurs, the pmap is locked, isn't it ?

    >> MMUEXT_NEW_BASEPTR MMUEXT_NEW_USER_BASEPTR

    Manuel> This is a context switch

    >> MMUEXT_TLB_FLUSH_LOCAL MMUEXT_INVLPG_LOCAL MMUEXT_TLB_FLUSH_MULTI
    >> MMUEXT_INVLPG_MULTI MMUEXT_TLB_FLUSH_ALL MMUEXT_INVLPG_ALL
    >> MMUEXT_FLUSH_CACHE

    Manuel> This may, or may not, cause a read. This usually happens
    Manuel> after updating the pmap, and I guess this also happens with
    Manuel> the pmap locked (I have not carefully checked).

    Manuel> So couldn't we just use the pmap lock for this ?  I suspect
    Manuel> the same lock will also be needed for out of band reads at
    Manuel> some point (right now it's protected by splvm()).

I'm a bit confused now - are we assuming that the pmap lock protects the
(pte update op queue-push(es) + pmap_pte_flush()) as a single atomic
operation (ie; no invlpg/tlbflush or out-of-band-read can occur between
the update(s) and the pmap_pte_flush()) ?

If so, I think I've slightly misunderstood the scope of the mmu queue
design - I assumed that the queue is longer-lived, and the sync points
(for the queue flush) can span across pmap locking - a sort of lazy pte
update, with the queue being flushed at out-of-band or in-band read
time ( I guess that won't work though - how does one know when the
hardware walks the page table ?) . It seems that the queue is meant for
pte updates in loops, for eg:, quickly followed by a flush. Is this
correct ? 

If so, there's just one hazard afaict - the synchronous TLB_FLUSH_MULTI
could beat the race between the queue update and the queue flush via
pmap_tlb_shootnow() (see pmap_tlb.c on the cherry-xenmp branch, and *if*
other CPUs reload their TLBs before the flush, they'll have stale info.

So the important question (rmind@ ?) is, is pmap_tlb_shootnow()
guaranteed to be always called with the pmap lock held ?

In real life, I removed the global xpq_queue_lock() and the pmap was
falling apart. So a bit of debugging ahead.

    >> [...]
    >> 
    >> >>> I'm thinking that it might be easier and more justifiable to
    >> >>> nuke the current queue scheme and implement shadow page
    >> tables, >>> which would fit more naturally and efficiently with
    >> CAS pte >>> updates, etc.
    >> >> 
    >> >> I'm not sure this would completely fis the issue: with shadow
    >> >> page tables you can't use a CAS to assure atomic operation
    >> with >> the hardware TLB, as this is, precisely, a shadow PT and
    >> not the >> one used by hardware.
    >> 
    >> Definitely worth looking into, I imho. I'm not very comfortable
    >> with the queue based scheme for MP.
    >> 
    >> the CAS doesn't provide any guarantees with the TLB on native
    >> h/w, afaict.

    Manuel> What makes you think so ? I think the hw TLB also does CAS
    Manuel> to update referenced and dirty bits in the PTE, otherwise we
    Manuel> couldn't rely on these bits; this would be bad especially
    Manuel> for the dirty bit.

Yes, I missed that one (which is much of the point of the CAS in the
first place!), you're right.

    >> If you do a CAS pte update, and the update succeeded, it's a good
    >> idea to invalidate + shootdown anyway (even on baremetal).

    Manuel> Yes, of course inval is needed after updating the PTE. But
    Manuel> using a true CAS is important to get the refereced and dirty
    Manuel> bits right.

I haven't looked into this in detail, but I thought that xen disconnects
the shadow (presumably with correct update/dirty bits)  and flushes the
TLB when a write to the shadow occurs ? In which case these bits will be
in sync until the next h/w access ? I'm speculating, I haven't looked at
the xen code for this yet.

-- 
Cherry

Follow-Ups:
- Re: CVS commit: src/sys/arch/xen
  - From: Jean-Yves Migeon
- Re: CVS commit: src/sys/arch/xen
  - From: Manuel Bouyer

References:
- Re: CVS commit: src/sys/arch/xen
  - From: Cherry G . Mathew
- Re: CVS commit: src/sys/arch/xen
  - From: Manuel Bouyer

Prev by Date: Re: Addition to kauth(9) framework
Next by Date: re: Addition to kauth(9) framework
Previous by Thread: Re: CVS commit: src/sys/arch/xen
Next by Thread: Re: CVS commit: src/sys/arch/xen
Indexes:

Home | Main Index | Thread Index | Old Index