Subject: Re: nore on disk stats
To: Charles Hannum <Charles-Hannum@deshaw.com>
From: Dennis Ferguson <dennis@Ipsilon.COM>
List: tech-kern
Date: 11/16/1995 15:54:38
>    (2) the total lack of atomicity when reading a table,
>
> We can't guarantee atomicity for large transfers anyway.  Note:
>
> 1) This would imply that the table is locked for the duration of a
> read.  This is impractical for things like network and other I/O
> transaction tables, for performance reasons.  It's not just

To which I would respond

(1) I'm certainly not arguing for atomic reads of all kernel tables,
    just for retaining the ability to accomodate this for those few
    where a process-level consumer of the information needs it.  In
    fact it is you who seems to be taking the extreme view that it is
    *never* necessary to do this (since SNMP can't); I'm only interested
    in those cases where I think an atomic table read is necessary for
    proper operation.

(2) You cite "performance reasons" as making this "impractical" (in all
    situations?), and give network I/O as an example.  Yet I'd point
    out that doing a full read of the routing table, and interface
    list, already delays all network I/O through the kernel (see
    net/rtsock.c, in sysctl_rtable()), and I've rewritten code for real
    life routers which used that system call when the size of that table
    was about 80% of the entire kernel.  In this particular case I know
    of *no* adverse performance impact of doing this which matters, even in
    worst-case practice, in fact just the opposite.  If the routing
    implementation needs to do something it is a whole lot better to
    let it get it done as quickly as possible, even at the expense of
    dropping a few packets, rather than holding it off indefinitely
    while forwarding a perhaps never-ending stream of packets.  The
    latter is a ticket to spectacular failures (I've seen this too).

    One might make a different argument about this in the case of hosts,
    rather than routers, but in the host case the table is normally very
    small anyway so who cares how you do it?

(3) Again, allowing atomic reads of some tables doesn't prohibit row-by-row
    reading of other tables, or even the same table.  What I'm suggesting
    is that sometimes doing atomic reads of tables is necessary and
    appropriate, and that a mechanism which precludes ever doing this is
    a bad idea.

> 2) Neither the current tools nor any competing proposal guarantees
> atomicity.

See above.  Reading the routing table and the interface list is already
atomic.

>   (3) the need to support getnext operations for all tables (sometimes this
>       is easy, but other times it is unnecessarily hard),
>
> Almost all tables in the kernel are either flat, linked lists, or
> trees.  For all of these, implementing the `getnext' operation is
> fairly trivial.

I've written an efficient getnext for radix trees (see rt_radix_getnext(),
in rt_radix.c in the gated source) and would dispute your definition of "fairly
trivial" (it would be "fairly trivial" to do this by locking the tree
against changes and then walking it, but if you're willing to do this then
it is inconsistent to be complaining about atomic reads of the table based
on performance concerns).  What bugs me most is not so much the amount of
code, but rather the fact that it is very hard to do well without adding extra
data structure only needed for this operation, even though the size of that
data structure is already a problem.

Never-the-less I still wouldn't mind this too much as long as you don't
remove all possibility of doing atomic reads of the same table when that
is an appropriate alternative required by an application.

>   (4) the fact that you are going to have difficulty doing things which the
>       corresponding SNMP MIB didn't consider you might want to do.
>
> No; you just have to extend the MIB.

In fact you can't extend the mib to support atomic reads of full tables,
particularly when performance is very important.  This is a transport issue.

>   I can think of some examples of all of these.  For netstat(1), or some
>   other interested piece of software, to read the kernel routing table
>   now requires about 3 system calls: a sysctl(2) to find out the size of
>   the thing, a call to sbrk()/mmap() to acquire the (possibly very large
>   chunk of) memory, and another call to sysctl(2) to fetch an atomic snapshot
>   of the table.
>
> First of all, this is an oversimplification.  There are actually a few
> system calls done per route. as you can see just by looking at the
> function p_rtentry() or ktrace output.

Hardly.  In fact I don't care if netstat(1) makes mistakes, I'll just
run it again.  I care a lot if my routing protocol implementation (i.e.
"other interested piece of software") makes mistakes, however, and for
that the process described is no over-simplification at all.

> Secondly, even if the above weren't the case,, the snapshot is not
> atomic.  There is no lock on the tables, and no lock on the memory
> they're being copied to, and network interrupts are not blocked while
> the table is copied.  As I said above, this would be undesirable for
> performance reasons.

Wrong (are we looking at the same kernel?).  The snapshot produced by
sysctl(2) is certainly atomic.  Network hardware interrupts aren't blocked,
of course, but the whole operation is run at splsoftnet() which effectively
blocks changes (net/rtsock.c, in sysctl_rtable()).  And I directly disagree
with your opinion of what is "undesirable for performance reasons" based
on experience with using this stuff in hard situations.

>   In any case, while making operations where it makes sense more SNMP-like
>   would be fine, particularly where SNMP is a frequent consumer, I think
>   there are a lot of cases where SNMP just gets in the way.  I'd rather
>   have the flexibility to do what is right, given a knowledge of how the
>   most frequent non-SNMP consumers of the data use it if they are important,
>   rather than to be limited to SNMP's one-size-fits-all constraints, standard
>   or not.
>
> I don't see how this is anything but an adverse reaction to my use of
> the term `SNMP'.  I'm using SNMP as a transport mechanism for
> information, and as such, it does not restrict the information we can
> make available.  Implementing dozens of ad-hoc solutions (which is
> what we currently do) is clearly a much worse scenario.

I'm reading my paragraph, and the first sentence in your paragraph, and
as far as I can tell your comment is nearly a non sequitur.  I generally
*like* SNMP, what I'm objecting to is its *exclusive* use as a transport
mechanism based not so much on what information can be presented as
the constraints on how the information can be supplied.  In particular
there is stilll a need for a very fast method for doing atomic reads of
full kernel tables, for those (few) tables where the applications which use
them need this, and the worst-case scenario seems to me to be having no
ability to do this at all based on some over-constrained view of how the
data should be transported.  And I am very sure that you've got the
performance argument for those two network tables in particular exactly
backwards; this is exactly a situation where the best system performance
is obtained by letting the application have a full, accurate snapshot
as quickly as possible with a minimum of fooling around, even at the
expense of delaying a packet or two.

Doing atomic reads of some tables is still necessary.

Dennis Ferguson