tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: CVS commit: src/sys/arch/xen

On Mon, Aug 29, 2011 at 03:03:37PM +0200, Cherry G. Mathew wrote:
> Hi Manuel,
> >>>>> "Manuel" == Manuel Bouyer <> writes:
> [...]
>     >> 
>     >> Xen's TLB_FLUSH operation is synchronous, and doesn't require an
>     >> IPI (within the domain), which makes the queue ordering even more
>     >> important (to make sure that stale ptes are not reloaded before
>     >> the per-cpu queue has made progress). Yes, we can implement a
>     >> roundabout ipi driven queueflush + tlbflush scheme(described
>     >> below), but that would be performance sensitive, and the basic
>     >> issue won't go away, imho.
>     >> 
>     >> Let's stick to the xpq ops for a second, ignoring "out-of-band"
>     >> reads (for which I agree that your assertion, that locking needs
>     >> to be done at a higher level, holds true).
>     >> 
>     >> The question here, really is, what are the global ordering
>     >> requirements of per-cpu memory op queues, given the following
>     >> basic "ops":
>     >> 
>     >> i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE)
>     >> ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE
>     Manuel> This is when adding/removing a page table from a pmap. When
>     Manuel> this occurs, the pmap is locked, isn't it ?
>     Manuel> This is a context switch
>     Manuel> This may, or may not, cause a read. This usually happens
>     Manuel> after updating the pmap, and I guess this also happens with
>     Manuel> the pmap locked (I have not carefully checked).
>     Manuel> So couldn't we just use the pmap lock for this ?  I suspect
>     Manuel> the same lock will also be needed for out of band reads at
>     Manuel> some point (right now it's protected by splvm()).
> I'm a bit confused now - are we assuming that the pmap lock protects the
> (pte update op queue-push(es) + pmap_pte_flush()) as a single atomic
> operation (ie; no invlpg/tlbflush or out-of-band-read can occur between
> the update(s) and the pmap_pte_flush()) ?

out of band reads can always occurs, there's no lock which can protect against

> If so, I think I've slightly misunderstood the scope of the mmu queue
> design - I assumed that the queue is longer-lived, and the sync points
> (for the queue flush) can span across pmap locking - a sort of lazy pte
> update, with the queue being flushed at out-of-band or in-band read
> time ( I guess that won't work though - how does one know when the
> hardware walks the page table ?) . It seems that the queue is meant for
> pte updates in loops, for eg:, quickly followed by a flush. Is this
> correct ? 

it was not explicitely designed this way, but I think that's how things are
in practice, yes. Usage would need to be checked, though.
There may be some special case in the kernel pmap area ...

> If so, there's just one hazard afaict - the synchronous TLB_FLUSH_MULTI
> could beat the race between the queue update and the queue flush via
> pmap_tlb_shootnow() (see pmap_tlb.c on the cherry-xenmp branch, and *if*
> other CPUs reload their TLBs before the flush, they'll have stale info.
> So the important question (rmind@ ?) is, is pmap_tlb_shootnow()
> guaranteed to be always called with the pmap lock held ?

I don't think so; but I also don't think that's the problem.
There shouldn't be more race with this than on native hardware.

> In real life, I removed the global xpq_queue_lock() and the pmap was
> falling apart. So a bit of debugging ahead.

Hum, on seocnd though, something more may be needed to protect the queue.
The pmap lock won't probably work for pmap_kernel ...

>     >> [...]
>     >> 
>     >> >>> I'm thinking that it might be easier and more justifiable to
>     >> >>> nuke the current queue scheme and implement shadow page
>     >> tables, >>> which would fit more naturally and efficiently with
>     >> CAS pte >>> updates, etc.
>     >> >> 
>     >> >> I'm not sure this would completely fis the issue: with shadow
>     >> >> page tables you can't use a CAS to assure atomic operation
>     >> with >> the hardware TLB, as this is, precisely, a shadow PT and
>     >> not the >> one used by hardware.
>     >> 
>     >> Definitely worth looking into, I imho. I'm not very comfortable
>     >> with the queue based scheme for MP.
>     >> 
>     >> the CAS doesn't provide any guarantees with the TLB on native
>     >> h/w, afaict.
>     Manuel> What makes you think so ? I think the hw TLB also does CAS
>     Manuel> to update referenced and dirty bits in the PTE, otherwise we
>     Manuel> couldn't rely on these bits; this would be bad especially
>     Manuel> for the dirty bit.
> Yes, I missed that one (which is much of the point of the CAS in the
> first place!), you're right.
>     >> If you do a CAS pte update, and the update succeeded, it's a good
>     >> idea to invalidate + shootdown anyway (even on baremetal).
>     Manuel> Yes, of course inval is needed after updating the PTE. But
>     Manuel> using a true CAS is important to get the refereced and dirty
>     Manuel> bits right.
> I haven't looked into this in detail, but I thought that xen disconnects
> the shadow (presumably with correct update/dirty bits)  and flushes the
> TLB when a write to the shadow occurs ? In which case these bits will be
> in sync until the next h/w access ? I'm speculating, I haven't looked at
> the xen code for this yet.

It's xen is emulating atomic operations on the shadow then that's probably
the way to go.

Manuel Bouyer <>
     NetBSD: 26 ans d'experience feront toujours la difference

Home | Main Index | Thread Index | Old Index