tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: CVS commit: src/sys/arch/xen



On Mon, Aug 29, 2011 at 12:07:05PM +0200, Cherry G. Mathew wrote:
>     JM> On Mon, 22 Aug 2011 12:47:40 +0200, Manuel Bouyer wrote:
>     >>> This is slightly more complicated than it appears. Some of the
>     >>> "ops" in a per-cpu queue may have ordering dependencies with
>     >>> other cpu queues, and I think this would be hard to express
>     >>> trivially. (an example would be a pte update on one queue, and
>     >>> reading the same pte read on another queue - these cases are
>     >>> quite analogous (although completely unrelated)
>     >> 
> 
> Hi,
> 
> So I had a better look at this - implemented per-cpu queues and messed
> with locking a bit:
> 
> 
>     >> read don't go through the xpq queue, don't they ?
> 
>     JM> Nope, PTE are directly obtained from the recursive mappings
>     JM> (vtopte/kvtopte).
> 
> Let's call this "out of band" reads. But see below for "in-band" reads.
> 
>     JM> Content is "obviously" only writable by hypervisor (so it can
>     JM> keep control of his mapping alone).
>     >> I think this is similar to a tlb flush but the other way round, I
>     >> guess we could use a IPI for this too.
> 
>     JM> IIRC that's what the current native x86 code does: it uses an
>     JM> IPI to signal other processors that a shootdown is necessary.
> 
> Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI
> (within the domain), which makes the queue ordering even more important
> (to make sure that stale ptes are not reloaded before the per-cpu queue
> has made progress). Yes, we can implement a roundabout ipi driven
> queueflush + tlbflush scheme(described below), but that would be
> performance sensitive, and the basic issue won't go away, imho.
> 
> Let's stick to the xpq ops for a second, ignoring "out-of-band" reads
> (for which I agree that your assertion, that locking needs to be done at
> a higher level, holds true).
> 
> The question here, really is, what are the global ordering requirements
> of per-cpu memory op queues, given the following basic "ops":
> 
> i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE)
> ii) read memory
>     via:
>     MMUEXT_PIN_L1_TABLE
>     MMUEXT_PIN_L2_TABLE
>     MMUEXT_PIN_L3_TABLE
>     MMUEXT_PIN_L4_TABLE
>     MMUEXT_UNPIN_TABLE

This is when adding/removing a page table from a pmap. When this occurs,
the pmap is locked, isn't it ?

>     MMUEXT_NEW_BASEPTR
>     MMUEXT_NEW_USER_BASEPTR

This is a context switch

>     MMUEXT_TLB_FLUSH_LOCAL
>     MMUEXT_INVLPG_LOCAL
>     MMUEXT_TLB_FLUSH_MULTI
>     MMUEXT_INVLPG_MULTI
>     MMUEXT_TLB_FLUSH_ALL
>     MMUEXT_INVLPG_ALL
>     MMUEXT_FLUSH_CACHE

This may, or may not, cause a read. This usually happens after updating
the pmap, and I guess this also happens with the pmap locked (I have not
carefully checked).

So couldn't we just use the pmap lock for this ?
I suspect the same lock will also be needed for out of band reads at some
point (right now it's protected by splvm()).

> [...]
> 
>     >>> I'm thinking that it might be easier and more justifiable to
>     >>> nuke the current queue scheme and implement shadow page tables,
>     >>> which would fit more naturally and efficiently with CAS pte
>     >>> updates, etc.
>     >> 
>     >> I'm not sure this would completely fis the issue: with shadow
>     >> page tables you can't use a CAS to assure atomic operation with
>     >> the hardware TLB, as this is, precisely, a shadow PT and not the
>     >> one used by hardware.
> 
> Definitely worth looking into, I imho. I'm not very comfortable with
> the queue based scheme for MP.
> 
> the CAS doesn't provide any guarantees with the TLB on native h/w,
> afaict.

What makes you think so ? I think the hw TLB also does CAS to update
referenced and dirty bits in the PTE, otherwise we couldn't rely on these
bits; this would be bad especially for the dirty bit.

> If you do a CAS pte update, and the update succeeded, it's a
> good idea to invalidate + shootdown anyway (even on baremetal).

Yes, of course inval is needed after updating the PTE. But using a true CAS
is important to get the refereced and dirty bits right.

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index