Subject: Re: improving kqueue
To: Matthew Mondor <>
From: Bill Studenmund <>
List: tech-kern
Date: 09/21/2006 10:35:45
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Sep 21, 2006 at 08:54:03AM -0400, Matthew Mondor wrote:
> If I understand (from on and off-list discussion), you want to receive
> notifications for a whole file hierarchy, while kqueue only allows you
> to listen for events for one directory/file per open file descriptor,
> and closing the descriptor of course discards any filter for it.
> This means that to implement something like fam(8) would mean
> recursively scanning a tree and opening a very high number of
> descriptors, which then becomes more problematic than occasionally
> recursively rescanning the whole tree using fstat(2) on every file at
> the expense of less real-time notification...

And it would waste a lot of kernel resources.

> With the current kqueue API we have, it wouldn't be possible to include
> a list of inodes for which you would like notification for a single
> filter.  However, it would be possible to allow a flag to be ORed
> meaning that you also want to be notified for all objects under a
> specific directory.

I think the thing to do here is take a page from the SVR4 book.

We add a field to struct vnode that indicates what "watcher" is watching=20
it. Then when we look up a vnode in a directory, we copy the watcher info=
down to it.

> The tricky part with this however would be the kernel implementation.
> All calls to knote(9) in the FS code would then need to be replaced by
> some function which also takes into account all children vnodes, and I'm
> not sure how efficient or easily to implement this would be.  We surely
> cannot hold the whole tree in RAM at all times, so the cache would
> probably be used, I'm not too familiar how it would be done without
> reading more of the source.

No, let's do something like what I mention. The difference is that all=20
vnodes that are being watched have notes indicating that they are being=20
watched. If a vnode has no "being watched" indication, it's not being=20
watched. :-)

> A potentially simpler solution would be to add a new filter type
> allowing to receive notification of all vnode events for a file system.
> This however would probably have to be restricted to the superuser for
> various reasons.  If this existed, however, it would probably be simple
> to have the kernel pass the inode on which the event occurs, as well as
> which type of event, and a single file descriptor could be used per file
> system.  With a system like this a fam(8) replacement (or optimization)
> could be written, perhaps.

Filehandle instead of inode, but yes. We did something like this for=20

> If this was implemented and worked, there yet would remain another
> problem to solve, however, which is inode to full path name resolution.=
> A possible solution would be to have your application maintain in
> userspace a btree or hash table of inode->name entries which it would
> need to populate firsthand and maintain through the course of the
> application as events are received.  This however can probably waste a
> considerable amount of memory.

It's the only thing you really can do.

> The other solution would require a system call which allows to easily
> resolve inode numbers to full path names.  There appears to be a minimal
> implementation of a reverse cache (which if I understand is similar to
> what would be needed), seen in options(4) as NAMECACHE_ENTER_REVERSE.=20
> According to the manual it isn't very useful yet except for a specific
> use.  I have no idea if this support could be enhanced (which would also
> allow fstat(1) to optionally be able to resolve to names the inodes it
> reports).
> Does anyone, who is more familiar with the iname cache, know if
> improving this reverse mapping support would be viable?  Would the
> memory requirements be as heavy as if it was done in userland?  Would it
> require major changes in VFS?

The biggest problem is that you can't perform a unique file handle (or=20
inode) to path mapping due to hard links.

If we have a directory, we can do a reverse lookup to figure out a path to=
root. That's how the getcwd systemcall works. Files, though, you may be=20
able to get _a_ path, but you can mixx some.

> Also, I would be interested to read some other ideas or pro/cons about
> the proposed new FS-wide kqueue filter.

Let's copy what SVR4 did, with some modernizations.

Take care,


Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.4.3 (NetBSD)