Subject: POSIX shm_open() vs. mmap(MAP_ANON|MAP_SHARED)....
To: Christoph Hellwig <hch@infradead.org>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 07/09/2003 14:17:32
[ On Wednesday, July 9, 2003 at 17:43:30 (+0100), Christoph Hellwig wrote: ]
> Subject: Re: fsync performance hit on 1.6.1
>
> Not trying to defend IEEE here, but there is some sense at leat behind
> shm_open.  Given that for shm your really want an object that's not
> backed by permantent storage (= a normal filesystem) you need to know
> where to look for a tmpfs-lookalike or, in the case you mentioned above
> something outside the normal filesystem namespace (yuck!).

You don't need such a concept for mmap(MAP_ANON|MAP_SHARED) -- the
filename is simply a key to the anonymous memory so that multiple
processes can map the same anonymous memory and thus share it.

>  As IEEE
> isn't into the filesystem namespace business shm_open is an okay wrapper
> for leaving this to the implementation.

Oh I agree there's some sense behing shm_open() -- just so long as you
ignore the MAP_ANON jumping up and down and waving its hands and
shouting at you from over in _front_ of the curtain over there....  :-)

> Why the heck they specified shm_unlink is completly unclear to me,
> though.

That one's easy!  ;-)

shm_unlink(), like unlink(), takes a pathname parameter, so given the
fact shm_open() names are strictly outside the normal visible filesystem
space then you need a matching unlink() interface to work in this
private, invisible, namespace.  (or at least you do so long as you don't
also have something like a funlink() call that takes an open file
descriptor as its parameter :-)

> Just because it was know that doesn't mean it should be standandardize.

Given the constraints of trying to work without MAP_ANON to thus end up
with the same functionality only after inventing a dozen new API
signatures to work around the lack of MAP_ANON is in fact a very good
reason to standardize a far simpler API.  That's why I say there must
have been some very strong politics influencing the committee members.
Normally these comittees are loathe to invent new APIs and the mere fact
that they started down that road when they thought they could do without
MAP_ANON should have suggested to them that they were going in the wrong
direction.  "Oops!  We're inventing something!  Let's go back to that
last fork in the road we took to get here!"  (Of course POSIX.4 seems to
be mostly cut from whole cloth so maybe they didn't share that same
desire to avoid invention in the standardization process.)

> And MAP_ANON really doesn't fit into the SunOS4/SVR4 VM that wants a backing
> vnode for each memory object unlike the Mach VM. 

I don't buy that argument at all.  SysVr4 VM has the concept of
anonymous memory and the swap layer provides the backing store for
anonymous pages.  I suspect forcing anonymous pages to always have the
MAP_PRIVATE attribute was their downfall.  Anonymous pages could have
been made sharable simply by associating a vnode from an ordinary file
descriptor with them -- i.e. there's a vnode but it's not what's mapped,
anonymous memory is mapped and thus the swap layer continues to provide
the backing store.  That's essentially how mmap(MAP_ANON|MAP_SHARED)
works, IIUC -- the filename, if given via an open file descriptor,
simply allows two independent processes to locate and attach the same
anonymous memory object and thus share it (i.e. the kernel does the
equivalent of an ftok() mapping to the object resource ID internally).

In fact SysV SHM is implemented in SysVr4, IIUC, using anonymous pages
that are tagged MAP_SHARED, and which have a reference to the anonymous
object (/dev/zero in brain-dead implementations), but the anonymous
object does not provide their backing store, the swap layer does
instead, as with all anonymous pages.  A trivial implementation of
mmap(MAP_ANON) could have been done with internal calls to shmget() and
shmat() along with a simple additional reference-count table to keep
track of the next available unique ID to use where the the mmap() call
did not supply an open file descriptor (i.e. mmap() calls with fd=-1)
and such that shmctl(IPC_RMID) could be called when the process exited.
Such a trivial implementation may end up only allowing 255 truly
anonymous (fd=-1) mappings in the whole system if it were to guarantee
to stay out of the ftok() namespace for any possible filename mapping,
but I think this certainly shows the possibility is/was there to
implement mmap(MAP_ANON) in SysVr4.

> Thus the horrible
> mmap() of /dev/zero hack, btw..

Hmmm.... yes.  What a stupid idea that was.  :-)  (A NULL vnode pointer
was apparently supposed to suffice such that a /dev/zero vnode was
unnecessary.)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>