Subject: Re: sup hell
To: None <current-users@NetBSD.ORG>
From: Daniel Carosone <email@example.com>
Date: 11/06/1995 11:14:17
Here's something I circulated around a few people a little while
back. I'm offering this up as a straw-man proposal, I'd like to get as
much detailed feedback as possible.
Sup seems to be causing some problems lately, for varying reasons.
While sup seems to do the job well enough for most of us most of the
time, it has some problems. It also seems unnecessarily complicated.
Here's some ideas for a really simple sup replacement. It works with
lists of files. Each step described here creates a new list, this is
to make the explanation more precise. In practice, you'd probably
modify at least some of the lists in-place.
List [a]: The server generates a list of all files currently in the
source tree, and their repsective checksums, each time the
source is updated. The server is free to use whatever
optimisations it likes based on file times or whatever to
cache entries in the list and otherwise reduce the cost of
generating it. Regenerating the list from scratch when
things become confused should always be possible. This list
is saved and gzipped.
List [b]: The client generates a similar list of what's on the local
disk, with checksums. Again, similar optimisations are
possible, but should always be able to be bypassed.
List [c]: The client keeps a list of all files ever offered by the
server. This is used for delete tracking. It starts out
empty for the first connection.
List [d]: The user provides the client with a list of files to ignore,
much like sup's refuse file. Ideally, this would be a list
of regex's rather than filenames, but the intent is the
List [e]: The server can optionally keep a list of files changed in
the last server update. This is for an optimisation
With all these lists in hand at the respective ends, the client
connects to the server, and fetches list [a].
The client compares lists [a] and [b], and creates a new list:
List [f]: All files in list [a] and not in list [b] are added to list
All files in list [a] and in list [b], but with differing
checksums, are added to list [f].
The client compares lists [a] and [c], and creates a two new lists:
List [g]: All files in list [c] and not in list [a] are added to list
List [h]: All files in either list [a] or list [c].
The client then compares (or regex matches) lists [f] and [g] with
list [d], and creates two new lists:
List [i]: All files in list [f] and not in list [d]. This list then
represents the files that must be updated from the server.
List [j]: All files in list [g] and not in list [d]. This list
represents files deleted on the server, files which may want
to be deleted by the client.
The client sends list [i] to the server.
The server should check the received list [i] for naughty filenames,
like .. or absolute paths. Probably the best way to do this, and other
sanity checking, is to check each entry in the client-supplied list
[i] against the server's copy of list [a].
The server pipes this list to tar or cpio as a list of files to
archive, and pipes the result back down to the client (optionally via
gzip), which extracts them.
Somewhere in here, perhaps at the end, or just before extracting
files, list [h] is saved to become list [c] on the next invocation.
Adam Glass feels that a very large majority of sup connection are
fetching the same set of files, particularly the set of files changed
in the last update. This has been recorded in list [e], to allow an
optimisation on the server:
When the server gets list [i], and has done its security and sanity
checking against list [a], the server can check list [i] against list
[e]. If the lists are identical, the server sends a pre-generated
List [e] can of course be generalised to as many (list,tarfile) pairs
as desired, and the tradeoff becomes server IO load to create new
tarfiles vs. disk space, list-compare time (which in most cases will
be quite quick if they differ), and maintenance.
Unresolved issues (suggestions welcome):
- symlinks (shouldn't occur in a cvs tree). Tar (with the
right switches) will handle these, but will want a little
thought to make sure the right thing is happening.
- directory deletion. Sup doesn't do this well either. Maybe
we need the lists to include files and directories (marked
somehow) so we can spot directories disappearing. Have to
make sure not to feed directory names to tar/cpio though, or
you'll fetch things twice, and might get other unwanted files
too (like list [d] ignored ones).
- have to transfer all of list [a] each connection. The
alternative is a more complex system, with timestamps that i
just don't trust, that starts to look a lot like sup. For my
source tree, find /usr/src/.. -print was around 700k, this
gzipped to 85k. This includes obj.sparc/* from a full build,
so should in rough figures indicate the size with checksums
in there instead.
The last thing that's needed is a little protocol glue at the start to
specify a "collection" -- which basically is just a selector for what
you get as list [a].
All of this could be done in a few dozen lines of perl. No arguments
about perl for the moment, please. If/when I implement this, it will
be in perl, if someone else wants to convert it into C or sh or
whatever, that's fine.
I'd like some feedback on these suggestions. I wouldn't mind
implementing this, but I won't be able to start that for a week or two
at least. Time for folks to think it over and offer suggestions.