tech-kern: Re: Generic Properties

Subject: Re: Generic Properties
To: None <thorpej@wasabisystems.com>
From: None <eeh@netbsd.org>
List: tech-kern
Date: 10/06/2001 01:32:22
| [ I'm re-adding tech-kern to the CC, because I really want to keep this
|   discussion there... ]

Hm.  Seems like your mailer is doing something funny w/the Cc: field.  
/usr/bin/mail can't find it.

| On Fri, Oct 05, 2001 at 09:47:47PM -0000, eeh@netbsd.org wrote:
|
|  > | 	- If "store" is NULL, the backing store for the object is
|  > | 	  allocated when the prop is first set with prop_set().
|  > | 	  Otherwise, the provided backing store is used for the
|  > | 	  object.  This can be used for things like variables that
|  > | 	  only need to be noticed when they're used, like some of
|  > | 	  the TCP tunables (e.g. "tcp_cwm").
|  > 
|  > May as well set the value with the same semantics as prop_set() and save
|  > an extra prop_set() call and all the overhead that can entail.
|
| If you want a prop_set() implied with each prop_create(), that creates
| an interesting problem.  Note in my definition of prop_create(), I said
| that if backing store was provided by prop_create(), then this backing
| store will always be used for the property value.  Note that this may not
| be what you want for an "implied prop_set()".  For the vast majority (all?)
| of sysctl-like properties, the prop_create() call will be providing backing
| store, and thus will be implicitly set.  For others, I don't think the two
| calls is that big of a deal.

If you don't provide the storage then you need to provide a value.  May as
well handle it in prop_create, since you're providing the `type' information
and the `size'.

|  > I presume the semantics of PROP_CONST would have to change then?  Otherwise
|  > a prop_set() with PROP_CONST set will replace the storage location.
|
| Aha... my brain is still thinking "CONST" means "cannot be changed", i.e.
| prop_set() is not allowed on this property -- it is read-only.  I would
| really prefer this be the case.  Yes, we would need an additional flag,
| such as "PROP_EXTSTORE", for internal use only, that would indicate that
| the property backing store is maintained outside the database (such as
| would be the case with virtually all sysctl-like properties).

The original purpose for `PROP_CONST' was as a space optimization for things
like constant strings and such.  That can easily be changed.

|  > | I think this is all doable, and I think it will pretty much address
|  > | everyone's concerns, and can be done with out too much hackery to
|  > | the current subr_prop.c.
|  > 
|  > We still don't have a specification for:
|
| Ok, ignoring the pieces that are still missing, do you find my suggestions
| reasonable?

Fine so far, although I'm undecided whether this functionality should be
part of the property infrastructure or layered on top of it.  But that can
be decided later.

|
|  > 	* A syscall interface
|
| I'm hand-waving over this right now... we can deploy the infrastructure
| in the kernel and provide backward-compatible sysctl interface (which
| we have to do ANYWAY), and doing this wouldn't even be that hard.
|
|  > 	* Access protection for individual properties
|
| See my comment about PROP_CONST.
|
|  > 	* A name/MIB space
|
| Err, I'm not sure what you mean, here.  As for a "name space", the
| name space would not be all that different than what we currently
| have for sysctl ... and you can "privatize" parts of it using the
| extension to propdb_create() that I suggested.

sysctl(2) currently uses a namespace consisting of a variable array
of integers, hence a MIB namespace.  sysctl(8) translates those values 
to/from strings.

Do we maintain a MIB namespace or go to a set of strings?  ISTR some
people were extolling the performance benefits of integers, however
a string namespace is extensible in was that MIBs are not.


The following group are rather closely interrelated:

|  > 	* Methods to traverse the sysctl tree
|
| Yah, I have thought about this .. it would be done a lot like how fts(3)
| is done ... "get a list of the nodes here, then descend into each node".
| If a node disappears before you get to it, so what... just return an error
| or skip it or whatever.  This can be implemented almost entirely in userland
| (need a way to query "which nodes are here?", and that's it).
|
|  > 	* Atomic query/update (so the object is not replaced while
|  > 	  you are trying to update it).
|
| Give me an example of what you want to do, here.  There are obvious
| locking issues here... although, we could add a "property" file descriptor
| on which one can do flock-style locking.

While you are busy walking the tree one node at a time from userland,
some kernel function or some other process could remove the node you're
looking at (just got a reference to/just got the name of) and replace 
it with a completely different node.  That is the locking issue, and
can be handled in a number of ways.

I think the best way would be to have the kernel sysctl() routine
do a complete lookup for each access, matching all the node's names.
But that is more difficult if you use integer values for the node
identifiers.

|
|  > 	* Handling databases with completely dynamic object space,
|  > 	  such as "kern.proc.*"
|
| See the "get" method that's passed in to prop_create().

If you are walking the tree from userland, rather than doing complete
namespace traversal for each call, you would first look up the database 
for "kern", then look up database for "proc" then look up the database 
for "pid" and finally look up `512' property.  sysctl_proc() really wants 
to install itself to intercept the query for the `512' property, but that
would require creating a `512' node, and a node for every other active
process.  A much better place to do the interception would be at the "pid"
node, but that would require interposing a different class of object at 
the `512' level that implements all the database operations.

|
|  > The last issue is probably the nastiest.  If you want to query "kern.proc.pid.512",
|  > you either need to intercept the request at "pid" and finish it off, or you need to
|  > create (and destroy) a property for every single process on the machine on fork()
|  > (and exit()).  Same for querying vnodes, mbufs, etc.
|
| There's no reason you couldn't do this (intercept the request at...),
| actually.  The node lookup could be implemented like pathname lookup,
| consuming as much of the path as it wants each time.

Once you get to this level of complexity I think you have to start to
re-think the idea a bit more.  Each property in the database now has two 
additional fields pointer fields, which are probably unused in the majority
of properties.  (They will probably only be used in leaf nodes.)

Walking the tree is becoming a complicated operation.  You need to 
recursively look up the `propdb' property for each element in the name, 
call the `get' method on it, and use the results to look at the next level.
The `get' method is now allowed to consume more than just one level of the
name, information that needs to be passed back to the walking routine.

It would make more sense to layer the sysctl() access routines on top
of the property routines, or even use a completely different datastructure
for the sysctl tree and use the property routines at the lowest levels
to access the actual data.  I'm not even sure they're a good match for
that.


I suppose if I were designing a sysctl replacement, I would replace the
current MIBs with a single string delimited by some character.  Then add
some routines to register each sysctl level, something like:

sysctl_add(char *parent, char *name, int mib /* compat */, void *data, 
	size_t size, int type, int (*handler)(...));
sysctl_get(char *path, void *buffer, size_t size, int *type);
sysctl_set(char *path, void *buffer, size_t size, int type);

Each call to sysctl_add() adds a node to the tree.  When processing
a query, each node is traversed.  If a handler is registered, it is
called, and can complete the query.  If it gets to the end of the
input string, it copies `size' bytes to/from `data' as appropriate,
if allowed by the protections encoded in the `type' field.  If no
handler is registered and no buffer is registered in the node it
finds when it gets to the end of the input string, sysctl_get()
could return the names of all it the child nodes at that level.

(N.B. you also need a sysctl_remove()).

But in any case, sysctl would need to be aware of and enforce
significantly more policy, whether in the base code or through
handlers, than would the property framework.

Eduardo