proplib and the jet age

To: tech-userlevel%netbsd.org@localhost
Subject: proplib and the jet age
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Mon, 31 Dec 2012 20:16:31 +0000

Almost ever since proplib was first imported there's been a steady low
leveling of grumbling about it, which occasionally erupts into open
arguments. This has gone on for quite a few years now, with the result
that the real problems, of which there are several, have by and large
been sifted out from the casual complaints. There hasn't been any
concerted attempt to address these problems, though, just occasional
random flailing (including on my part) that hasn't really gone
anywhere.

It seems to me that the basic complaints, reflecting real problems,
are the following:

(1) It was never clear whether proplib is supposed to be a data
*transfer* API (that is, data lives in application data structures and
gets loaded into and out of proplib only for shipment) or a data
*storage* API (that is, data lives in proplib and applications use
proplib to access it). As a result of this proplib has aspects of both
these things and is a good solution for neither.

It has become clear in the past few years that NetBSD needs a data
transfer library. (Writing out to disk and reading back again later is
a form of this.) Essentially all the extant and serious proposed uses
of proplib fall into this category. There is no evidence that we need
another data storage library; we already have db(3), and for cases
where that's not enough nowadays we have sqlite. Ergo, what we want
from proplib is data transfer, and any plans for the future should
take this into account.

(There is also a demand for tools to handle configuration files;
this is a different problem for reasons I'll elaborate on at length if
provoked.)

(2) The data model is poorly considered. It provides a random
selection of atom types without any visible coherent plan, and in
particular it has internal problems with signed vs. unsigned integers.
Meanwhile the composite types are only very basic (arrays and maps
that are limited to string -> T) and there is no model for what kinds
of more complicated structures these can and cannot be assembled into.
(For example, you can use dictionaries to assemble graphs; but it is
far from clear what happens if you try to dump out the results.)

There are several possible more coherent data models that we could
choose. I'm going to tackle this in more detail below, as it's the
chief question going forward.

(3) The API is a mess at the detail level. In addition to being wordy
and generally cumbersome (hence all the "proppropliblib" jokes), it is
not only not type-safe but actively type-unsafe in a particularly
hazardous way, it has weird reference count semantics, and it
furthermore has unclear error semantics with far too many error cases
that leave no clear recovery method.

All of these things can be done better; it is just ("just") API
design.

(4) There is no support whatsoever for schemas or any other method of
specifying or validating what data is supposed to be present in what
structure. Relatedly, there is no support for handling format version
information.

(5) The code is bloaty. The implementation is at least 2x the size it
needs to be for the functionality it provides.

(6) The output transfer format is something everyone dislikes (XML)
which is itself bloaty and space-wasting. Furthermore, there's only
the one output format.

(These last two problems can readily be fixed by writing new code.)

Now.

It seems to me that the current proplib API is a large part of the
problem, so any significant changes for the future should probably
include deprecating it and replacing it with a new API. Note that as a
result of points 1-3, at least one developer has already written a
library whose sole purpose is to interface to proplib.

The chief question, therefore, is what data model the new stuff should
support. There are at least seven obvious candidates I can think of:

(a) What we have in proplib (arrays and string-keyed dictionaries)
only, with the explicit understanding that only tree structures are
supported and not graphs; that is, no dictionary or array can appear
more than once.

(b) Same as (a) but extend dictionaries to be keyable with arbitrary
atom types.

(c) A more general semistructured model, like (b) but that explicitly
allows graph structure without being fully graph-oriented.

(d) RDF, or more likely a tasteful subset of RDF with data types
instead of using URIs for everying. (http://www.w3.org/RDF/)

(e) Property graphs.
(https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model)

(f) Property graphs where property values can be tuples rather than
only atoms.

(g) Relations (tables of rows with named fields).

These all have their advantages and disadvantages. (a) and (b) are
simpler, but are also fairly limited. (c) and (d) are nearly the same
modulo how much W3C koolaid is involved. (f) rectifies a weakness of
(e) that I've run into; (c)/(d) and (f) are both supersets of (g).
The chief difference between (d) and (e) is that (e) is more
structured; RDF allows assembling contraptions that are not graphs,
like edges that point to edges.

I don't think the relational data model is a good choice here. The
basic question is whether we want to handle graph structure or not.
There are arguments both ways. The chief argument against is that it
isn't clear how much there's a real need to ship graph-structured data
around. One of the big arguments for, however, is that there are
several preexisting inoffensive transfer formats for graph data (e.g.,
Turtle for RDF and the graphviz dot format for property graphs) and
there is no such thing for purely hierarchical data.

I would say (based on having been dealing with graph-oriented
semistructured data in my day job for some seven years now) that the
API-level cost of dealing with graph data is negligible until you get
into iteration, which is nasty regardless, and the implementation-
level cost is small, and graph models have the nice property of being
supersets of everything else. If it were just me, I think my choice
would be (f).

However, since there is basically zero chance we want to do graph
theory on graphs we store in this thing, and we don't necessarily care
if edges point to edges and so on, (d) might be a better choice if
someone's willing to filter the W3C RDF koolaid and come up with a
coherent proposal for a model that doesn't have W3C glop coming out of
its ears. I can do this if there's demand, but I think property graphs
are a better choice.

The primary downside is that to the best of my knowledge schemas for
graph data are a research topic, although not necessarily a
particularly difficult one. (If anyone knows otherwise, please let me
know!)

On the other hand, at one point several years back during one of the
proplib arguments I spent a few hours implementing about half of a
replacement. If someone wanted to finish it, it would only take a few
more hours probably, and it would be a decent start at a replacement
proplib using data model (a).

Opinions please.

oh, and in case anyone was wondering: ultramarine, with violet and
cream accents.

-- 
David A. Holland
dholland%netbsd.org@localhost

Follow-Ups:
- Re: proplib and the jet age
  - From: Izaac
- Re: proplib and the jet age
  - From: David Holland
- Re: proplib and the jet age
  - From: David Young
- Re: proplib and the jet age
  - From: Love Hörnquist Åstrand
- Re: proplib and the jet age
  - From: Marc Balmer

Prev by Date: Re: grep vs. CVE-2012-5667 (integer type too small)
Next by Date: Re: proplib and the jet age
Previous by Thread: Return Value of realloc(3)
Next by Thread: Re: proplib and the jet age
Indexes:

Home | Main Index | Thread Index | Old Index