Re: XIP

To: Chuck Silvers <chuq%chuq.com@localhost>
Subject: Re: XIP
From: Masao Uebayashi <uebayasi%tombi.co.jp@localhost>
Date: Sun, 31 Oct 2010 14:27:33 +0900
Hi,

On Sat, Oct 30, 2010 at 06:55:42PM -0700, Chuck Silvers wrote:
> On Wed, Oct 27, 2010 at 06:38:11PM +0900, Masao Uebayashi wrote:
> > On Tue, Oct 26, 2010 at 02:06:38PM -0700, Chuck Silvers wrote:
> > > On Mon, Oct 25, 2010 at 02:09:43AM +0900, Masao Uebayashi wrote:
> > > > I think the uebayasi-xip branch is ready to be merged.
> > > 
> > > hi,
> > > 
> > > here's what I found looking at the current code in the branch:
> > > 
> > > 
> > >  - the biggest issue I had with the version that I reviewed earlier
> > >    was that it muddled the notion of a "managed" page.  you wanted
> > >    to create a new kind of partially-managed page for XIP devices
> > >    which would not be managed by the UVM in the sense that it could
> > >    contain different things depending on what was needed but only that
> > >    the page's mappings would be tracked so that pmap_page_protect()
> > >    could be called on it.  this is what led to all the pmap changes
> > >    the pmaps needed to be able to handle being called with a vm_page
> > >    pointer that didn't actually point to a struct vm_page.
> > > 
> > >    it looks like you've gotten rid of that, which I like, but you've
> > >    replaced it with allocating a full vm_page structure for every page in
> > >    an XIP device, which seems like a waste of memory.  as we discussed
> > >    earlier, I think it would be better to treat XIP pages as unmanaged
> > >    and change a few other parts of UVM to avoid needing to track the
> > >    mappings of XIP page mappings.  I have thoughts on how to do all that,
> > >    which I'll list at the end of this mail.  however, if XIP devices
> > >    are small enough that the memory overhead of treating device pages
> > >    as managed is reasonable, then I'll go along with it.
> > >    so how big do these XIP devices get?
> > 
> > It's waste of memory, yes.  With 64M ROM on arm (4K page, 80byte vm_page),
> > the array is 1.25M.  If vm_page's made a single pointer (sizeof(void
> > *) == 4), the array size becomes 64K.  Not small difference.
> > Typical XIP'ed products would be mobile devices with FlashROM, or small
> > servers with memory disk (md or xmd).  About 16M~1G RAM/ROM?
> > 
> > I made it back to have vm_page to simplify code.  We can make it
> > to vm_page_md or whatever minimal, once after we figure out the
> > new design of "MI vm_page_md".
> > 
> > >    either way, the changes to the various pmaps to handle the fake vm_page
> > >    pointers aren't necessary anymore, so all those changes should be 
> > > reverted.
> > 
> > Those mechanical vm_page -> vm_page_md changes done in pmaps have
> > a valid point by itself.  Passing around vm_page pointer across
> > pmap functions is unnecessary.  I'd rather say wrong.  All pmap
> > needs is vm_page_md.
> > 
> > I'd propose to do this vm_page -> vm_page_md change in pmaps first
> > in HEAD and sync the branch with it, rather than revert it.
> 
> that seems a bit premature, don't you think?
> since you're already talking about redesigning it?
> 
> ... apparently not, you already checked it in.

It's not premature.  It clarifies that passing struct vm_page * to
pmap is unnecessary at all.  We'll need to move those MD PV data
structures to MI anyway.

This is what I'm having in my head:

        struct vm_physseg {
                :
                paddr_t start, end;
                :
                struct vm_page *pgs;
                :
                struct vm_pv *pvs;
                :
        };

        struct vm_pv {
                /* == struct vm_page_md, or what's called "PV head" now */
        };

        struct vm_pv_entry {
                /* == what's called "PV entry" now */
        };

All pmaps are converted to use struct vm_pv array, instead of struct
vm_page::struct vm_page_md.

Here, the page identity is (struct vm_physseg *, off_t).  You can
retrieve the physical address of a given struct vm_page * or struct
vm_pv *, by looking up the matching struct vm_physseg from the
global list (as talked about killing vm_page::phys_addr).

Of course XIP-capable, read-only physical segments have only vm_pv[].
I think tracking PV there for shared amap is worth doing in the
"generic" XIP design.

pmaps are passed struct vm_pv *, not struct vm_page *.  Physical
address is looked up by calling a function.

> 
> 
> > >    it doesn't look like the PMAP_UNMANAGED flag that you added before
> > >    is necessary anymore either, is that right?  if so, it should also
> > >    be reverted.  if not, what's it for now?
> > 
> > pmap_enter() passes paddr_t directly to pmap.  pmap has no clue if
> > the given paddr_t is to be cached/uncached.  So a separate flag,
> > PMAP_UNMANAGED.  Without it, a FlashROM which is registered as
> > (potentially) managed (using bus_space_physload_device) is always
> > mapped to user address as cached, even if it's not XIP.  This made
> > userland flash writer program not work.
> > 
> > The point is whether to be cached or not is decided by virtual
> > address.  Thus such an information should be stored in vm_map_entry
> > explicitly, in theory.
> > 
> > (This is related to framebuffers too.)
> 
> there is already an MI mechanism for specifying cachability of mappings,
> see PMAP_NOCACHE and related flags.

This is fixed, thanks.

> 
> 
> > >  - I mentioned before that I think the driver for testing this using
> > >    normal memory to simulate an XIP device should just be the existing
> > >    "md" driver, rather than a whole new driver whose only purpose
> > >    would be testing the XIP code.  however, you wrote a separate
> > >    "xmd" driver anyway.  so I'll say again: I think the "xmd" should be
> > >    merged into the "md" driver.
> > 
> > It has turned out that the new xmd(4) is not really md(4); it's not
> >     md(4) modified to support XIP
> > but rather
> >     rom(4) emulated using RAM
> > 
> > I plan to merge xmd(4) into rom(4) or flash(4) when it's ready.
> > If you don't like xmd(4) to reserve dev namespace, I'd rather back
> > out xmd(4) totally.
> > 
> > If you don't like the name, suggest a better one. ;)
> 
> I'll respond to this in the other thread you started about it.
> but creating a temporary device driver does seem pretty silly.
> 
> 
> > >    you also have an "xmd_machdep.c" for various platforms, but nothing
> > >    in that file is really machine-specific.  rather than use the
> > >    machine-specific macros for converting between bytes and pages,
> > >    it would be better to either use the MI macros (ptoa / atop)
> > >    or just shift by PAGE_SHIFT, so that there doesn't need to be
> > >    extra code written for each platform.
> > 
> > Well, xmd_machdep.c is not really xmd(4)-specific.  It *is*
> > machine-specific.  It does vtophys(), not good put in MI part.
> > 
> > ... and now I realize xmd_machdep_* should be renamed as pmap_*.
> > I'll fix this.
> 
> and this in the other thread too.

I've made some changes this week.  I've realized that xmd(4) is as
strange as mem(4). :)  I believe its design is clarified now (I'll
fix mdsetimage(8) soon).  The only remaining problem is its name.

> 
> 
> > >  - as we discussed before, the addition of bus_space_physload() and 
> > > related
> > >    interfaces seems very strange to me, especially since the two 
> > > implementations
> > >    you've added both do exactly the same thing (call bus_space_mmap() on 
> > > the
> > >    start and end of the address range, and then call uvm_page_physload()
> > >    with the resulting paddr_t values).  is there any reason this can't be
> > >    a MI function?   also, the non-device versions of these are unused.
> > 
> > I'll move these code to sys/common.
> 
> the location in the source tree isn't the weird part.
> but a common implementation is an improvement.
> 
> 
> > bus_space_physload(9) will be used when a driver wants to register
> > its memory as a general purpose memory (linked into free list).
> > Think PCI RAM over bridges.  I've not written a driver to do this,
> > but it's trivial.
> > 
> > We also have to fix uvm_page_physload() so that it can be called
> > at run-time, not only boot-time.  But this is beyond this branch.
> > 
> > >  - in many files the only change you made was to replace the include of
> > >    <uvm/uvm_extern.h> with <uvm/uvm.h>, what's the reason for that?
> > 
> > To reduce rebuild time when I worked on uvm_page.[ch].  Basically
> > uvm_page.h is internal to UVM.  Touching it outside of UVM is simply
> > wrong.
> > 
> > I'll put it back before the merge, to avoid potential build failures.
> 
> thanks.

Do you agree that exposing uvm_page.h to external world is wrong?
Touch uvm_page.h, rebuild a kernel, then think.  Why do those
rebuilt files care about uvm_page.h definitions?  This is a design
problem, putting it back is just a work-around.

> 
> 
> > >  - we talked before about removing the "xip" mount option and enabling
> > >    this functionality automatically when it's available, which you did
> > >    but then recently changed back.  so I'll ask again,
> > >    why do we need a mount option?
> > 
> > I put it back because otherwise I can't know if a mount is XIP or
> > not explicitly.  Mount option is the only way to know mount condition.
> > This is more a VFS/mount design problem.
> 
> is your desire to control whether the mount accesses the device via mappings,
> or just to be able to see whether or not the device is being accessed via
> mappings?

Both.

My understanding is that mount options change only its internal behavior.
I don't see how MNT_LOG and MNT_XIP differ.  Mount is done only by
admins who know internals.  There may be cases where an XIP device
is much slower than RAM, and page cache is wanted.

I also think that mount option is the only way to configure per-mount
behavior.

> 
> 
> > >  - as we've discussed before, I think that the XIP genfs_getpages()
> > >    should be merged with the existing one before you merge this into
> > >    -current.  merging it as-is would make a mess, and it's better to
> > >    avoid creating messes than than to promise to clean them up later.
> > >    we've probably already spent more time arguing about it than
> > >    it would have taken you to do it.
> > > 
> > >  - similarly, I don't think that we should have a separate XIP putpages.
> > >    why doesn't the existing genfs_do_putpages() work for XIP pages as-is?
> > 
> > This part I'm not convinced.  Your argument can be hardly an excuse
> > for you to not work on merging genfs_getpages() and genfs_do_io(). ;)
> > 
> > If you *really* want me to work on this before the merge (which I
> > do hope not), I'd create a separate branch to refactor genfs totally
> > into smaller pieces as I did for uvm_fault.c.  If you trust my
> > refactoring-fu.
> 
> well, it's been about a decade since I wrote this code so my memory is
> a bit hazy as to how it ended up that way, but from looking at the cvs
> history a bit just now, I think that the getpages and putpages paths
> started off completely separate, and when I added the directio code,
> I split genfs_do_io() out of the putpages code and called it from both.
> yes, it would have been good to merge the getpages version at that point
> too, but there wasn't a point where I took a unified code path and made
> a copy of it.
> 
> and I'm not giving any argument for not merging these functions,
> I'm saying that their existance is not a good excuse to make another copy.

IMO genfs abstraction is wrong in the first place. ;)  All of its code
should belong to either sys/uvm/uvm_vnode.c or sys/kern/vfs_*.c.

> 
> to try to answer my own question about why you have a separate putpages,
> I guess that's to handle the page of zeroes that you have attached to
> each vnode.  without the per-vnode page of zeroes, the existing
> putpages would work fine.

I'll rethink this zero page thing.

> 
> 
> > >  - in getpages, rather than allocating a page of zeros for each file,
> > >    it would be better to use the same page of zeroes that mem.c uses.
> > >    if multiple subsystems need a page of zeroes, then I think that
> > >    UVM should provide this functionality rather than duplicating
> > >    that all over the place.  if you don't want to update all the copies
> > >    of mem.c, that's fine, but please put this in UVM instead of genfs.
> > 
> > I made it per-vnode, because If you share a single physical zero'ed
> > page, when a vnode is putpage'ed, it invalidates all mappings in
> > other vnodes pointing to the zero page in a pmap.  Doesn't this
> > sound strange?
> 
> I think you're misunderstanding something.  for operations like
> msync(), putpage is called on the object in the UVM map entry,
> which would never be the object owning the global page of zeroes.
> from the pagedaemon, putpage is called on a page's object,
> but if the page replacement policy decides to reclaim the page
> of zeroes then invalidating all the mappings is necessary.
> if what you describe would actually happen, then yes,
> I would find it strange.

Give me some time to rethink this zero page thing...

> 
> 
> > Another reason is that by having a vm_page with a uvm_object as
> > its parent, fault handler doesn't need to know anything about the
> > special zero'ed page.
> 
> the fault handler wouldn't need to know about the page of zeroes,
> why do you think it would?  it does need to be aware that the pages
> returned by getpages may not be owned by the object that getpages
> is called with, but it needs to handle that regardless, for layered
> file systems.

Ditto (but my gut feeling is, layered file systems are totally
broken by design.)

> 
> 
> > >  - I don't remember if I asked about this before: is there some reason
> > >    for there to be XIP code in the fs-specific code?  couldn't this be
> > >    done in the VFS and vnode code instead so that it works for all
> > >    file system types instead of just FFS?
> > 
> > This is due to the current VFS design.  I agree that fs-neutral part
> > should be moved to common part.  It's just beyond this branch...
> 
> could you explain what about the VFS design makes this hard?
> 
> I guess I mentioned this mainly because in your earlier mail
> it sounded like this feature would be available for all fs types.

All mount options are processed in VFS_MOUNT()...

And yes, this XIP can be used for all FSes.

> 
> 
> > >  - as you guessed in your mail, I don't like the new PQ_FIXED flag either.
> > >    it looks like it's only used for preventing the XIP pages from
> > >    being put on the paging queues, but that would be better done by
> > >    just initializing pg->wire_count to 1 so that UVM treats them
> > >    as permanently non-pageable (which they actually are).
> > 
> > Are you OK to prevent XIP pages from being loaned? ;)
> > 
> > The actual problems here are
> > 
> >  a) the generic fault handler doesn't know how to handle pages that
> >     never be paged out (device pages), and
> > 
> >  b) uvm loan uses vm_page.
> > 
> > You wanted device pages to be able to be loaned.  I agree the idea,
> > but I'm feeling like that the way the loan "state" is kept
> > (vm_page::loan_count) is wrong.  I've not figured out how to address
> > this yet.
> 
> I've thought about loaning and XIP again quite a bit, but in the interest
> of actually sending this mail today, I'll leave this part for later too.

I spent a little time for this...

Now uvm_loan() returns to a loaner an array of pointers to struct
vm_page.  Which means that those struct vm_page's are shared among
loaner and loanee.  The owner of those pages are recorded in
vm_page::uanon and vm_page::uobject.

I'm thinking this design is broken.  At least A->A loan is impossible
for no reason.  I think loanee should allocate (by kmem_alloc(9))
a new *alias* vm_page, to keep loaning state.  Alias vm_page has
a pointer to its new owner, and with a pointer to the *real* vm_page
too.

uanon/uobject is also bad in that it obscures the 1:1 relationship
of pages and owners.  These two members should be merged into a single
void *pg_owner.  vm_page::loan_count is only for loaner.  Loaners
of read-only, wired pages don't need to remember anything.

Masao

> 
> -Chuck

-- 
Masao Uebayashi / Tombi Inc. / Tel: +81-90-9141-4635
Follow-Ups:
- Re: XIP
  - From: Mindaugas Rasiukevicius
- PQ_FIXED (was Re: XIP)
  - From: Masao Uebayashi
- Re: XIP
  - From: Masao Uebayashi
References:
- XIP
  - From: Masao Uebayashi
- Re: XIP
  - From: Chuck Silvers
- Re: XIP
  - From: Masao Uebayashi
- Re: XIP
  - From: Chuck Silvers
Prev by Date: Re: XIP
Next by Date: AMAP_SHARED (was Re: XIP)
Previous by Thread: Re: XIP
Next by Thread: Re: XIP
Indexes:
Home | Main Index | Thread Index | Old Index