Re: XIP

To: Masao Uebayashi <uebayasi%tombi.co.jp@localhost>
Subject: Re: XIP
From: Chuck Silvers <chuq%chuq.com@localhost>
Date: Tue, 26 Oct 2010 14:06:38 -0700
On Mon, Oct 25, 2010 at 02:09:43AM +0900, Masao Uebayashi wrote:
> I think the uebayasi-xip branch is ready to be merged.

hi,

here's what I found looking at the current code in the branch:


 - the biggest issue I had with the version that I reviewed earlier
   was that it muddled the notion of a "managed" page.  you wanted
   to create a new kind of partially-managed page for XIP devices
   which would not be managed by the UVM in the sense that it could
   contain different things depending on what was needed but only that
   the page's mappings would be tracked so that pmap_page_protect()
   could be called on it.  this is what led to all the pmap changes
   the pmaps needed to be able to handle being called with a vm_page
   pointer that didn't actually point to a struct vm_page.

   it looks like you've gotten rid of that, which I like, but you've
   replaced it with allocating a full vm_page structure for every page in
   an XIP device, which seems like a waste of memory.  as we discussed
   earlier, I think it would be better to treat XIP pages as unmanaged
   and change a few other parts of UVM to avoid needing to track the
   mappings of XIP page mappings.  I have thoughts on how to do all that,
   which I'll list at the end of this mail.  however, if XIP devices
   are small enough that the memory overhead of treating device pages
   as managed is reasonable, then I'll go along with it.
   so how big do these XIP devices get?

   either way, the changes to the various pmaps to handle the fake vm_page
   pointers aren't necessary anymore, so all those changes should be reverted.

   it doesn't look like the PMAP_UNMANAGED flag that you added before
   is necessary anymore either, is that right?  if so, it should also
   be reverted.  if not, what's it for now?

   if we do keep the vm_page structures for device pages, I think
   it would be better to put them in the same vm_physseg array as
   the normal memory rather than having a parallel set of functions
   and data structures for device pages.  device pages would just be
   loaded with avail_start and avail_end indicating that none of the pages
   should be added to the freelist.  is there any other way that
   device pages need to be treated differently from normal pages
   as far as their physseg information is concerned?

 - I mentioned before that I think the driver for testing this using
   normal memory to simulate an XIP device should just be the existing
   "md" driver, rather than a whole new driver whose only purpose
   would be testing the XIP code.  however, you wrote a separate
   "xmd" driver anyway.  so I'll say again: I think the "xmd" should be
   merged into the "md" driver.

   you also have an "xmd_machdep.c" for various platforms, but nothing
   in that file is really machine-specific.  rather than use the
   machine-specific macros for converting between bytes and pages,
   it would be better to either use the MI macros (ptoa / atop)
   or just shift by PAGE_SHIFT, so that there doesn't need to be
   extra code written for each platform.

 - any function that might be called from a kernel module should always exist,
   regardless of which kernel options are enabled.  the particular one
   that I noticed is uvm_page_physload_device().  in general, it seems to me
   that if this were better intergrated then it wouldn't be worthwhile having
   an XIP kernel option at all, since there would be very little code to omit.

 - trivial amounts of code in kernel modules shouldn't be compile-time
   conditional either, so that there don't need to be separate binaries
   for the module with and without the option.  the various XIP bits in xmd.c
   fall into this category.

 - as we discussed before, the addition of bus_space_physload() and related
   interfaces seems very strange to me, especially since the two implementations
   you've added both do exactly the same thing (call bus_space_mmap() on the
   start and end of the address range, and then call uvm_page_physload()
   with the resulting paddr_t values).  is there any reason this can't be
   a MI function?   also, the non-device versions of these are unused.

 - in many files the only change you made was to replace the include of
   <uvm/uvm_extern.h> with <uvm/uvm.h>, what's the reason for that?

 - why is the "majors" entry for the "flash" driver in the MD majors files?
   isn't this an MI driver?

 - in flash.c, most of the body of flash_init() is ifdef'd out,
   what's up with that?

 - we talked before about removing the "xip" mount option and enabling
   this functionality automatically when it's available, which you did
   but then recently changed back.  so I'll ask again,
   why do we need a mount option?

 - as we've discussed before, I think that the XIP genfs_getpages()
   should be merged with the existing one before you merge this into
   -current.  merging it as-is would make a mess, and it's better to
   avoid creating messes than than to promise to clean them up later.
   we've probably already spent more time arguing about it than
   it would have taken you to do it.

 - similarly, I don't think that we should have a separate XIP putpages.
   why doesn't the existing genfs_do_putpages() work for XIP pages as-is?

 - in getpages, rather than allocating a page of zeros for each file,
   it would be better to use the same page of zeroes that mem.c uses.
   if multiple subsystems need a page of zeroes, then I think that
   UVM should provide this functionality rather than duplicating
   that all over the place.  if you don't want to update all the copies
   of mem.c, that's fine, but please put this in UVM instead of genfs.

 - I don't remember if I asked about this before: is there some reason
   for there to be XIP code in the fs-specific code?  couldn't this be
   done in the VFS and vnode code instead so that it works for all
   file system types instead of just FFS?

 - as you guessed in your mail, I don't like the new PQ_FIXED flag either.
   it looks like it's only used for preventing the XIP pages from
   being put on the paging queues, but that would be better done by
   just initializing pg->wire_count to 1 so that UVM treats them
   as permanently non-pageable (which they actually are).



that's all I noticed in the current version of the code.
I haven't yet gone back and looked at my previous comments to see if
I overlooked anything this time around, I'll do that tomorrow.

needless to say, I don't think this work is ready to be merged yet.



now here's the explanation I promised for how to treat XIP pages
as unmanaged instead of managed.

first, some background for other people who don't know all this:
the only reason that treating XIP pages as managed pages is
relevant at all is because of the AMAP_SHARED flag in UVM,
which allows anonymous memory to be shared between processes
such that the changes made by one process are seen by the other.
this impacts XIP pages (which are not anonymous) because a
PROT_WRITE, MAP_PRIVATE mapping of an XIP vnode should point to
the XIP pages as long as all access to the mapping is for reads,
but when the mapping is written to then the XIP page should be
copied to an anonymous page (the normal COW operation) but that
new anonymous page should still be shared between all processes
that are sharing the AMAP_SHARED mapping.  to force those other
processes to take another page fault the next time they access
their copy of the mapping (which we need to do so that they will
start accessing the new anonymous page instead of the XIP page),
we must invalidate all the other pmap entries for the XIP page,
which we do by calling pmap_page_protect() on it.  the pmap layer
tracks all the mappings of the page and thus it can find them all.

there are several ways that the AMAP_SHARED functionality is used,
and unfortunately they would need to changed in different ways to
make this work for XIP pages without needing to track mappings:

 - uvm_io(), which copies data between the kernel or current process
   and an arbitrary other process address space.
   currently this works by sharing the other address space with
   the kernel via uvm_map_extract() and then just using uiomove()
   to transfer the data.  this could be done instead by using part
   of the uvm_fault() code to find the physical page in the other
   address space that we want to access and lock it (ie. set PG_BUSY),
   map the page into the kernel (perhaps using uvm_pager_mapin()),
   transfer the data, then unmap the page and unlock it.

 - uvm_mremap(), which resizes an existing mapping.
   this uses uvm_map_extract() internally, which uses AMAP_SHARED,
   but the mremap operation doesn't actually need the semantics of
   AMAP_SHARED since as mremap doesn't create any additional mappings
   as far as applications are concerned.  the usage of AMAP_SHARED
   is just a side-effect of the current implementation, which bends
   over backward to call a bunch of existing higher-level functions
   rather than doing something more direct (which would be simpler
   and more efficient as well).

 - MAP_INHERIT_SHARE, used to implement minherit().
   this is the one that is the most trouble, since it's what AMAP_SHARED
   was invented for.  however, it's also of least importance since
   some searching with google finds absolutely no evidence of
   any application actually using it, just lots of copies of
   the implementations and manpages for various operating systems.

   with that in mind, there are several ways this could be handled.
        (1) just drop support for minherit() entirely.
        (2) reject attempts to set MAP_INHERIT_SHARE on XIP mappings.
        (3) copy XIP pages into normal anon pages when setting
                MAP_INHERIT_SHARE.
        (4) copy XIP pages into normal vnode pages when setting
                MAP_INHERIT_SHARE.  this would mean that the getpages
                code would need to look in the normal page cache
                before using XIP pages.  I think this option would also
                need getpages to know about the inherit flag to
                correctly handle later faults on XIP mappings,
                and there are probably other sublte complications.

   of these choices, (2) sounds like the best compromise to me.


this approach would also bring back some issues where our previous
discussion went around in circles, such as callers of VOP_GETPAGES()
wanting vm_page pointers but XIP pages not having them, but
I'll leave that additional discussion for future email if necessary.

I'm just going over all this now so that it's clear to everyone
that this kind of approach is possible if the memory overhead of
the full vm_page structures for XIP pages is deemed too high.


-Chuck
Follow-Ups:
- AMAP_SHARED (was Re: XIP)
  - From: Masao Uebayashi
- Re: XIP
  - From: Masao Uebayashi
References:
- XIP
  - From: Masao Uebayashi
Prev by Date: Re: Capsicum: practical capabilities for UNIX
Next by Date: Re: XIP
Previous by Thread: Re: XIP
Next by Thread: Re: XIP
Indexes:
Home | Main Index | Thread Index | Old Index