current-users: Re: sup hell

Subject: Re: sup hell
To: None <current-users@NetBSD.ORG>
From: Daniel Carosone <dan@anarres.mame.mu.oz.au>
List: current-users
Date: 11/06/1995 11:14:17
Here's something I circulated around a few people a little while
back. I'm offering this up as a straw-man proposal, I'd like to get as
much detailed feedback as possible. 

---

Sup seems to be causing some problems lately, for varying reasons.
While sup seems to do the job well enough for most of us most of the
time, it has some problems. It also seems unnecessarily complicated.

Here's some ideas for a really simple sup replacement. It works with
lists of files. Each step described here creates a new list, this is
to make the explanation more precise. In practice, you'd probably
modify at least some of the lists in-place.

List [a]: The server generates a list of all files currently in the
          source tree, and their repsective checksums, each time the
          source is updated. The server is free to use whatever
          optimisations it likes based on file times or whatever to
          cache entries in the list and otherwise reduce the cost of
          generating it. Regenerating the list from scratch when
          things become confused should always be possible. This list
          is saved and gzipped.

List [b]: The client generates a similar list of what's on the local
          disk, with checksums. Again, similar optimisations are
          possible, but should always be able to be bypassed.

List [c]: The client keeps a list of all files ever offered by the
          server. This is used for delete tracking. It starts out
          empty for the first connection.

List [d]: The user provides the client with a list of files to ignore,
          much like sup's refuse file. Ideally, this would be a list
          of regex's rather than filenames, but the intent is the
          same.

List [e]: The server can optionally keep a list of files changed in
          the last server update. This is for an optimisation
          discussed later.


With all these lists in hand at the respective ends, the client
connects to the server, and fetches list [a].


The client compares lists [a] and [b], and creates a new list:

List [f]: All files in list [a] and not in list [b] are added to list
          [f]. 
          All files in list [a] and in list [b], but with differing
          checksums, are added to list [f]. 


The client compares lists [a] and [c], and creates a two new lists:

List [g]: All files in list [c] and not in list [a] are added to list
          [g].

List [h]: All files in either list [a] or list [c].


The client then compares (or regex matches) lists [f] and [g] with
list [d], and creates two new lists:

List [i]: All files in list [f] and not in list [d]. This list then
          represents the files that must be updated from the server.

List [j]: All files in list [g] and not in list [d]. This list
          represents files deleted on the server, files which may want
          to be deleted by the client.



The client sends list [i] to the server.

The server should check the received list [i] for naughty filenames,
like .. or absolute paths. Probably the best way to do this, and other
sanity checking, is to check each entry in the client-supplied list
[i] against the server's copy of list [a].

The server pipes this list to tar or cpio as a list of files to
archive, and pipes the result back down to the client (optionally via
gzip), which extracts them. 

Somewhere in here, perhaps at the end, or just before extracting
files, list [h] is saved to become list [c] on the next invocation.



Adam Glass feels that a very large majority of sup connection are
fetching the same set of files, particularly the set of files changed
in the last update. This has been recorded in list [e], to allow an
optimisation on the server:

When the server gets list [i], and has done its security and sanity
checking against list [a], the server can check list [i] against list
[e]. If the lists are identical, the server sends a pre-generated
tarfile instead.

List [e] can of course be generalised to as many (list,tarfile) pairs
as desired, and the tradeoff becomes server IO load to create new
tarfiles vs. disk space, list-compare time (which in most cases will
be quite quick if they differ), and maintenance.


Unresolved issues (suggestions welcome):
	- symlinks  (shouldn't occur in a cvs tree). Tar (with the
	  right switches) will handle these, but will want a little
	  thought to make sure the right thing is happening.
	- directory deletion. Sup doesn't do this well either. Maybe
	  we need the lists to include files and directories (marked
	  somehow) so we can spot directories disappearing. Have to
	  make sure not to feed directory names to tar/cpio though, or
	  you'll fetch things twice, and might get other unwanted files
	  too (like list [d] ignored ones).
	- have to transfer all of list [a] each connection. The
	  alternative is a more complex system, with timestamps that i
	  just don't trust, that starts to look a lot like sup. For my
	  source tree, find /usr/src/.. -print was around 700k, this
	  gzipped to 85k. This includes obj.sparc/* from a full build,
	  so should in rough figures indicate the size with checksums
	  in there instead.

The last thing that's needed is a little protocol glue at the start to
specify a "collection" -- which basically is just a selector for what
you get as list [a].

All of this could be done in a few dozen lines of perl. No arguments
about perl for the moment, please. If/when I implement this, it will
be in perl, if someone else wants to convert it into C or sh or
whatever, that's fine.

I'd like some feedback on these suggestions. I wouldn't mind
implementing this, but I won't be able to start that for a week or two
at least. Time for folks to think it over and offer suggestions.