Re: XIP

To: Chuck Silvers <chuq%chuq.com@localhost>
Subject: Re: XIP
From: Masao Uebayashi <uebayasi%tombi.co.jp@localhost>
Date: Wed, 27 Oct 2010 18:38:11 +0900
On Tue, Oct 26, 2010 at 02:06:38PM -0700, Chuck Silvers wrote:
> On Mon, Oct 25, 2010 at 02:09:43AM +0900, Masao Uebayashi wrote:
> > I think the uebayasi-xip branch is ready to be merged.
> 
> hi,
> 
> here's what I found looking at the current code in the branch:
> 
> 
>  - the biggest issue I had with the version that I reviewed earlier
>    was that it muddled the notion of a "managed" page.  you wanted
>    to create a new kind of partially-managed page for XIP devices
>    which would not be managed by the UVM in the sense that it could
>    contain different things depending on what was needed but only that
>    the page's mappings would be tracked so that pmap_page_protect()
>    could be called on it.  this is what led to all the pmap changes
>    the pmaps needed to be able to handle being called with a vm_page
>    pointer that didn't actually point to a struct vm_page.
> 
>    it looks like you've gotten rid of that, which I like, but you've
>    replaced it with allocating a full vm_page structure for every page in
>    an XIP device, which seems like a waste of memory.  as we discussed
>    earlier, I think it would be better to treat XIP pages as unmanaged
>    and change a few other parts of UVM to avoid needing to track the
>    mappings of XIP page mappings.  I have thoughts on how to do all that,
>    which I'll list at the end of this mail.  however, if XIP devices
>    are small enough that the memory overhead of treating device pages
>    as managed is reasonable, then I'll go along with it.
>    so how big do these XIP devices get?

It's waste of memory, yes.  With 64M ROM on arm (4K page, 80byte vm_page),
the array is 1.25M.  If vm_page's made a single pointer (sizeof(void
*) == 4), the array size becomes 64K.  Not small difference.
Typical XIP'ed products would be mobile devices with FlashROM, or small
servers with memory disk (md or xmd).  About 16M~1G RAM/ROM?

I made it back to have vm_page to simplify code.  We can make it
to vm_page_md or whatever minimal, once after we figure out the
new design of "MI vm_page_md".

> 
>    either way, the changes to the various pmaps to handle the fake vm_page
>    pointers aren't necessary anymore, so all those changes should be reverted.

Those mechanical vm_page -> vm_page_md changes done in pmaps have
a valid point by itself.  Passing around vm_page pointer across
pmap functions is unnecessary.  I'd rather say wrong.  All pmap
needs is vm_page_md.

I'd propose to do this vm_page -> vm_page_md change in pmaps first
in HEAD and sync the branch with it, rather than revert it.

> 
>    it doesn't look like the PMAP_UNMANAGED flag that you added before
>    is necessary anymore either, is that right?  if so, it should also
>    be reverted.  if not, what's it for now?

pmap_enter() passes paddr_t directly to pmap.  pmap has no clue if
the given paddr_t is to be cached/uncached.  So a separate flag,
PMAP_UNMANAGED.  Without it, a FlashROM which is registered as
(potentially) managed (using bus_space_physload_device) is always
mapped to user address as cached, even if it's not XIP.  This made
userland flash writer program not work.

The point is whether to be cached or not is decided by virtual
address.  Thus such an information should be stored in vm_map_entry
explicitly, in theory.

(This is related to framebuffers too.)

> 
>    if we do keep the vm_page structures for device pages, I think
>    it would be better to put them in the same vm_physseg array as
>    the normal memory rather than having a parallel set of functions
>    and data structures for device pages.  device pages would just be
>    loaded with avail_start and avail_end indicating that none of the pages
>    should be added to the freelist.  is there any other way that
>    device pages need to be treated differently from normal pages
>    as far as their physseg information is concerned?

Two arrays can be merged, and it's a more correct design, considering
physical segments are unique by physical addresses, either memory
or device.  They're separate just as a result of incremental
development.  I'll merge them.

> 
>  - I mentioned before that I think the driver for testing this using
>    normal memory to simulate an XIP device should just be the existing
>    "md" driver, rather than a whole new driver whose only purpose
>    would be testing the XIP code.  however, you wrote a separate
>    "xmd" driver anyway.  so I'll say again: I think the "xmd" should be
>    merged into the "md" driver.

It has turned out that the new xmd(4) is not really md(4); it's not
        md(4) modified to support XIP
but rather
        rom(4) emulated using RAM

I plan to merge xmd(4) into rom(4) or flash(4) when it's ready.
If you don't like xmd(4) to reserve dev namespace, I'd rather back
out xmd(4) totally.

If you don't like the name, suggest a better one. ;)

> 
>    you also have an "xmd_machdep.c" for various platforms, but nothing
>    in that file is really machine-specific.  rather than use the
>    machine-specific macros for converting between bytes and pages,
>    it would be better to either use the MI macros (ptoa / atop)
>    or just shift by PAGE_SHIFT, so that there doesn't need to be
>    extra code written for each platform.

Well, xmd_machdep.c is not really xmd(4)-specific.  It *is*
machine-specific.  It does vtophys(), not good put in MI part.

... and now I realize xmd_machdep_* should be renamed as pmap_*.  I'll fix this.

> 
>  - any function that might be called from a kernel module should always exist,
>    regardless of which kernel options are enabled.  the particular one
>    that I noticed is uvm_page_physload_device().  in general, it seems to me
>    that if this were better intergrated then it wouldn't be worthwhile having
>    an XIP kernel option at all, since there would be very little code to omit.

OK.

> 
>  - trivial amounts of code in kernel modules shouldn't be compile-time
>    conditional either, so that there don't need to be separate binaries
>    for the module with and without the option.  the various XIP bits in xmd.c
>    fall into this category.

OK.

> 
>  - as we discussed before, the addition of bus_space_physload() and related
>    interfaces seems very strange to me, especially since the two 
> implementations
>    you've added both do exactly the same thing (call bus_space_mmap() on the
>    start and end of the address range, and then call uvm_page_physload()
>    with the resulting paddr_t values).  is there any reason this can't be
>    a MI function?   also, the non-device versions of these are unused.

I'll move these code to sys/common.

bus_space_physload(9) will be used when a driver wants to register
its memory as a general purpose memory (linked into free list).
Think PCI RAM over bridges.  I've not written a driver to do this,
but it's trivial.

We also have to fix uvm_page_physload() so that it can be called
at run-time, not only boot-time.  But this is beyond this branch.

> 
>  - in many files the only change you made was to replace the include of
>    <uvm/uvm_extern.h> with <uvm/uvm.h>, what's the reason for that?

To reduce rebuild time when I worked on uvm_page.[ch].  Basically
uvm_page.h is internal to UVM.  Touching it outside of UVM is simply
wrong.

I'll put it back before the merge, to avoid potential build failures.

> 
>  - why is the "majors" entry for the "flash" driver in the MD majors files?
>    isn't this an MI driver?

Because I didn't know MI sys/conf/major. ;)  I'll fix this soon.

> 
>  - in flash.c, most of the body of flash_init() is ifdef'd out,
>    what's up with that?

Someone is working on a better flash(4).  Mine will be replaced
very soon.

> 
>  - we talked before about removing the "xip" mount option and enabling
>    this functionality automatically when it's available, which you did
>    but then recently changed back.  so I'll ask again,
>    why do we need a mount option?

I put it back because otherwise I can't know if a mount is XIP or
not explicitly.  Mount option is the only way to know mount condition.
This is more a VFS/mount design problem.

> 
>  - as we've discussed before, I think that the XIP genfs_getpages()
>    should be merged with the existing one before you merge this into
>    -current.  merging it as-is would make a mess, and it's better to
>    avoid creating messes than than to promise to clean them up later.
>    we've probably already spent more time arguing about it than
>    it would have taken you to do it.
> 
>  - similarly, I don't think that we should have a separate XIP putpages.
>    why doesn't the existing genfs_do_putpages() work for XIP pages as-is?

This part I'm not convinced.  Your argument can be hardly an excuse
for you to not work on merging genfs_getpages() and genfs_do_io(). ;)

If you *really* want me to work on this before the merge (which I
do hope not), I'd create a separate branch to refactor genfs totally
into smaller pieces as I did for uvm_fault.c.  If you trust my
refactoring-fu.

> 
>  - in getpages, rather than allocating a page of zeros for each file,
>    it would be better to use the same page of zeroes that mem.c uses.
>    if multiple subsystems need a page of zeroes, then I think that
>    UVM should provide this functionality rather than duplicating
>    that all over the place.  if you don't want to update all the copies
>    of mem.c, that's fine, but please put this in UVM instead of genfs.

I made it per-vnode, because If you share a single physical zero'ed
page, when a vnode is putpage'ed, it invalidates all mappings in
other vnodes pointing to the zero page in a pmap.  Doesn't this
sound strange?

Another reason is that by having a vm_page with a uvm_object as
its parent, fault handler doesn't need to know anything about the
special zero'ed page.

> 
>  - I don't remember if I asked about this before: is there some reason
>    for there to be XIP code in the fs-specific code?  couldn't this be
>    done in the VFS and vnode code instead so that it works for all
>    file system types instead of just FFS?

This is due to the current VFS design.  I agree that fs-neutral part
should be moved to common part.  It's just beyond this branch...

> 
>  - as you guessed in your mail, I don't like the new PQ_FIXED flag either.
>    it looks like it's only used for preventing the XIP pages from
>    being put on the paging queues, but that would be better done by
>    just initializing pg->wire_count to 1 so that UVM treats them
>    as permanently non-pageable (which they actually are).

Are you OK to prevent XIP pages from being loaned? ;)

The actual problems here are

 a) the generic fault handler doesn't know how to handle pages that
    never be paged out (device pages), and

 b) uvm loan uses vm_page.

You wanted device pages to be able to be loaned.  I agree the idea,
but I'm feeling like that the way the loan "state" is kept
(vm_page::loan_count) is wrong.  I've not figured out how to address
this yet.

> 
> 
> 
> that's all I noticed in the current version of the code.
> I haven't yet gone back and looked at my previous comments to see if
> I overlooked anything this time around, I'll do that tomorrow.
> 
> needless to say, I don't think this work is ready to be merged yet.
> 
> 
> 
> now here's the explanation I promised for how to treat XIP pages
> as unmanaged instead of managed.
> 
> first, some background for other people who don't know all this:
> the only reason that treating XIP pages as managed pages is
> relevant at all is because of the AMAP_SHARED flag in UVM,
> which allows anonymous memory to be shared between processes
> such that the changes made by one process are seen by the other.
> this impacts XIP pages (which are not anonymous) because a
> PROT_WRITE, MAP_PRIVATE mapping of an XIP vnode should point to
> the XIP pages as long as all access to the mapping is for reads,
> but when the mapping is written to then the XIP page should be
> copied to an anonymous page (the normal COW operation) but that
> new anonymous page should still be shared between all processes
> that are sharing the AMAP_SHARED mapping.  to force those other
> processes to take another page fault the next time they access
> their copy of the mapping (which we need to do so that they will
> start accessing the new anonymous page instead of the XIP page),
> we must invalidate all the other pmap entries for the XIP page,
> which we do by calling pmap_page_protect() on it.  the pmap layer
> tracks all the mappings of the page and thus it can find them all.
> 
> there are several ways that the AMAP_SHARED functionality is used,
> and unfortunately they would need to changed in different ways to
> make this work for XIP pages without needing to track mappings:
> 
>  - uvm_io(), which copies data between the kernel or current process
>    and an arbitrary other process address space.
>    currently this works by sharing the other address space with
>    the kernel via uvm_map_extract() and then just using uiomove()
>    to transfer the data.  this could be done instead by using part
>    of the uvm_fault() code to find the physical page in the other
>    address space that we want to access and lock it (ie. set PG_BUSY),
>    map the page into the kernel (perhaps using uvm_pager_mapin()),
>    transfer the data, then unmap the page and unlock it.
> 
>  - uvm_mremap(), which resizes an existing mapping.
>    this uses uvm_map_extract() internally, which uses AMAP_SHARED,
>    but the mremap operation doesn't actually need the semantics of
>    AMAP_SHARED since as mremap doesn't create any additional mappings
>    as far as applications are concerned.  the usage of AMAP_SHARED
>    is just a side-effect of the current implementation, which bends
>    over backward to call a bunch of existing higher-level functions
>    rather than doing something more direct (which would be simpler
>    and more efficient as well).
> 
>  - MAP_INHERIT_SHARE, used to implement minherit().
>    this is the one that is the most trouble, since it's what AMAP_SHARED
>    was invented for.  however, it's also of least importance since
>    some searching with google finds absolutely no evidence of
>    any application actually using it, just lots of copies of
>    the implementations and manpages for various operating systems.
> 
>    with that in mind, there are several ways this could be handled.
>       (1) just drop support for minherit() entirely.
>       (2) reject attempts to set MAP_INHERIT_SHARE on XIP mappings.
>       (3) copy XIP pages into normal anon pages when setting
>               MAP_INHERIT_SHARE.
>       (4) copy XIP pages into normal vnode pages when setting
>               MAP_INHERIT_SHARE.  this would mean that the getpages
>               code would need to look in the normal page cache
>               before using XIP pages.  I think this option would also
>               need getpages to know about the inherit flag to
>               correctly handle later faults on XIP mappings,
>               and there are probably other sublte complications.
> 
>    of these choices, (2) sounds like the best compromise to me.
> 
> 
> this approach would also bring back some issues where our previous
> discussion went around in circles, such as callers of VOP_GETPAGES()
> wanting vm_page pointers but XIP pages not having them, but
> I'll leave that additional discussion for future email if necessary.
> 
> I'm just going over all this now so that it's clear to everyone
> that this kind of approach is possible if the memory overhead of
> the full vm_page structures for XIP pages is deemed too high.

I'll reply this part in another mail.

> 
> 
> -Chuck

In summary, except a few mistakes (MI major), I'm pretty aware of
all the problems you've written *except* the loaning part.

Masao
Follow-Ups:
- Re: XIP
  - From: Chuck Silvers
- xmd(4) (Re: XIP)
  - From: Masao Uebayashi
References:
- XIP
  - From: Masao Uebayashi
- Re: XIP
  - From: Chuck Silvers
Prev by Date: Re: XIP
Next by Date: xmd(4) (Re: XIP)
Previous by Thread: Re: XIP
Next by Thread: xmd(4) (Re: XIP)
Indexes:
Home | Main Index | Thread Index | Old Index