Subject: Re: fsync performance hit on 1.6.1
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 07/09/2003 04:36:06
[ On Wednesday, July 9, 2003 at 03:11:19 (-0400), der Mouse wrote: ]
> Subject: Re: fsync performance hit on 1.6.1
>
> >> A new _flat_ namespace for each resource.
> >> A new flat namespace _with human-meaningless names_ for each resource.
> > Have you never heard of inode numbers?   :-)
> 
> Certainly.  How many APIs use them?

Unix filesystems use inodes.  :)

but my point was that a namespace consisting of integers isn't such a
horrible idea

> > Perhaps you're not aware of ftok(3) and its use to map normal
> > pathnames into IPC identifiers?
> 
> Certainly.  How does it avoid collisions?  (Hint: it can't.  There can
> be, and on large systems not too infrequently are, more files than
> there are possible IPC IDs.)  How is there any excuse for tying this
> new namespace to the existing filesystem namespace?

Note I had not actually ever looked at NetBSD's implementation before,
but I see now that it is far less than ideal, and strictly speaking is
not correct since as far as I can tell it cannot possibly conform to the
full requirements of POSIX 1003.1-2001.

ftok() _should_ always create a unique key to match any unique file and
'id' parameter (and of course _should_ always return the same key for
the same file & 'id').  Most implementations are done in userland (and
POSIX requires that a userland implementation be possible) and like the
half-baked one in NetBSD they use the combined st_ino and st_dev numbers
found from stat()ing the file (and then shift in the low 8 bits of the
'id' parameter so that each file can also represent 2^8 unique keys).
Once upon a time the bits used from each value would probably  have
guaranteed a unique key, but since then the values have widened
significantly.

> It's fine to have that option.  It's broken to have that as the only
> option.  (Whether it should be the default is debatable.)

You can ask for the world, but that doesn't mean you'll get it!  ;-)

There is no "option" to have a file automatically deleted on last close.
You have to unlink() it explicitly and then its allocated storage will
be released on last close.  The unlink() means the file is immediately
invisible to all other processes which do not have it open().  Does this
mean the open()/close()/unlink() API is also fundamentaly broken?  I
don't think so!

I would say that any API for any complex functionality would be broken
by being too complex to use if it had every imaginable option and
feature.

Just like the files on a memory filesystem, SysV IPC entities persist
until reboot so that they can be used by transient processes and there
exist tools to manage them as necessary.  I.e. their whole API is
sufficient to do everything that's necessary, and it is not bloated with
options and feature that would only rarely be used.

> > poll() came along to the systems in question quite a bit later than
> > message queues
> 
> ...so?  Taking an OS whose unifying concept is "everything is a file
> descriptor" and creating three new object types that can't have file
> descriptors associated with them is a good way to create problems.

Everything is a file descriptor in unix unless it's an object located a
memory address.  :-)

Message queues are the only one of the three types that I think could
have logically been identified as file descriptors.  However with
semaphores and shared memory requiring some other kind of "handle" to
allow them to be shared between processes it no doubt made an enormous
amount of sense, especially at the time, to implement message queues in
the same way.

Personally I've always thought that tacking anonymous memory mapping
onto mmap() was a poor hack and apparently the POSIX committees did as
well since even the latest POSIX mmap() spec. doesn't specify MAP_ANON.
Even support of MAP_FIXED is implementation defined.  It's also
theoretically impossible for some implementations to mmap() "large"
files (for some definition of "large") since it is entirely possible
that they would not fit in address space remaining in even the smallest
process and thus the success of mmap() even for all real files is not
guaranteed.  Is mmap() really that much of a gain over read() and
write() from a conceptual API design point of view given real-world
implementation constraints?  I.e. shared memory is an address, not a
file.

(semaphores as implemented in SysV IPC aren't exactly addresses I
suppose, but they're closer to being addresses at a conceptual level
than they are to being files)

Besides, hindsight is 20/20 and there has now been nearly two decades of
hindsight with which to look back on and critique the SysV IPC APIs.

> So they not only repeated the mistakes of signals, they repeated the
> mistake of making them unreliable.

No, they did not.  The notifier _can_ be reliably re-installed.

> And this is the API you are holding up as a paragon of goodness??  I
> shudder to imagine your idea of a bad API.

I have no love of the POSIX message queue API.

Actually I'd much rather use SysV message queues than these new POSIX
message queues, even though the former have no means of asynchronous
notification (and no way to peek at their content).  And that's not just
because I've lots of experience using SysV message queues either....

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>