Subject: Re: improving kqueue
To: None <tech-kern@netbsd.org>
From: Matthew Mondor <mm_lists@pulsar-zone.net>
List: tech-kern
Date: 09/21/2006 08:54:03
If I understand (from on and off-list discussion), you want to receive
notifications for a whole file hierarchy, while kqueue only allows you
to listen for events for one directory/file per open file descriptor,
and closing the descriptor of course discards any filter for it.

This means that to implement something like fam(8) would mean
recursively scanning a tree and opening a very high number of
descriptors, which then becomes more problematic than occasionally
recursively rescanning the whole tree using fstat(2) on every file at
the expense of less real-time notification...

With the current kqueue API we have, it wouldn't be possible to include
a list of inodes for which you would like notification for a single
filter.  However, it would be possible to allow a flag to be ORed
meaning that you also want to be notified for all objects under a
specific directory.

The tricky part with this however would be the kernel implementation.
All calls to knote(9) in the FS code would then need to be replaced by
some function which also takes into account all children vnodes, and I'm
not sure how efficient or easily to implement this would be.  We surely
cannot hold the whole tree in RAM at all times, so the cache would
probably be used, I'm not too familiar how it would be done without
reading more of the source.

A potentially simpler solution would be to add a new filter type
allowing to receive notification of all vnode events for a file system.
This however would probably have to be restricted to the superuser for
various reasons.  If this existed, however, it would probably be simple
to have the kernel pass the inode on which the event occurs, as well as
which type of event, and a single file descriptor could be used per file
system.  With a system like this a fam(8) replacement (or optimization)
could be written, perhaps.

If this was implemented and worked, there yet would remain another
problem to solve, however, which is inode to full path name resolution. 
A possible solution would be to have your application maintain in
userspace a btree or hash table of inode->name entries which it would
need to populate firsthand and maintain through the course of the
application as events are received.  This however can probably waste a
considerable amount of memory.

The other solution would require a system call which allows to easily
resolve inode numbers to full path names.  There appears to be a minimal
implementation of a reverse cache (which if I understand is similar to
what would be needed), seen in options(4) as NAMECACHE_ENTER_REVERSE. 
According to the manual it isn't very useful yet except for a specific
use.  I have no idea if this support could be enhanced (which would also
allow fstat(1) to optionally be able to resolve to names the inodes it
reports).

Does anyone, who is more familiar with the iname cache, know if
improving this reverse mapping support would be viable?  Would the
memory requirements be as heavy as if it was done in userland?  Would it
require major changes in VFS?

Also, I would be interested to read some other ideas or pro/cons about
the proposed new FS-wide kqueue filter.

Thanks,
Matt

-- 
Note: Please only reply on the list, other mail is blocked by default.
Private messages from your address can be allowed by first asking.