tech-userlevel: Re: Inquiry re: rsync replacement

Subject: Re: Inquiry re: rsync replacement
To: None <tech-userlevel@netbsd.org>
From: Michiel Buddingh' <ajuin@stack.nl>
List: tech-userlevel
Date: 04/24/2006 17:04:40
On Mon, Apr 24, 2006 at 03:15:00PM +0200, Hubert Feyrer wrote:
> On Mon, 24 Apr 2006, Jan Schaumann wrote:
> IIRC the idea behind this 'rsync replacement' was to get a 
> mixture between SUP(server) and rsync: rsync runs over all the 
> disk and looks what's new,  and doing that for many concurrent 
> clients is thrashing the disk very  much. SUP(server) on the 
> other hand does some periodic scans of "what's  new" (or when
> it got new), and when a client comes it it already knows 
> what's new (for the client).

That's what cvsup/csup/cvsync do; they use the information stored by
cvs to know the differences between the client's version of the
fileset and the server's version.  What rsync does is slightly 
different; it establishes the differences between the remote fileset
and local fileset when they cannot be known beforehand (e.g. the
files are not stored in cvs).

That's not to say what you propose isn't possible.  What normal 
rsync does is scan the files on the client side, generate a sorted 
list of files and hashes, and sends it to the server, which then 
scans its files for differences (which is by far the most io 
intensive part) and sends them back.

An alternative approach is to let the server generate the list of
files and hashes (which might even be an mtree spec file(!)), send it
to the client, then let the client do the scanning while the server
waits for the client to request the sections of files that are
different.

It is important that the file/hash list remains in sync with the archive
itself, but if that can be taken care of (possibly by using something
like sysutils/fam ?) there's no reason why rsync should cause any
more load on the server than ftp or http servers.  Of course, client
side io load would go up considerably...

The reason why GNU/rsync didn't take this approach in the first place
is probably that it takes one extra round-trip, which slows the transfer
down a bit, and that there's no real gain for incidental transfers
where it doesn't matter which side has to do more work.

> In other words: the goal is to reduce load on the server, while keeping       
> the interface towards the client.                                             

That's difficult.  By its very nature, rsync has to exchange a lot of
information, but can't waste a lot of bytes doing so, because that might
offset any speedup gained.  I don't believe the rsync developers have
ever made any attempt to standardise their protocol, and newer versions
of rsync tend to use slightly different versions of the protocol (and
can switch back to an older version of the protocol if they encounter
an older version on the other side of the connection).

Compatibility with GNU/rsync, apart from being at odds with the proposed
performance improvements, would require implementing not one, but several
undocumented protocols.

-- 
		-- Michiel