tech-net: Odd cpio/network behaviour

Subject: Odd cpio/network behaviour
To: None <tech-net@netbsd.org, tech-userlevel@netbsd.org>
From: Lucio De Re <lucio@proxima.alt.za>
List: tech-net
Date: 11/18/2001 18:16:24
I'm not sure which of these is the right forum.

I've been running a rather hacky backup procedure from a SCO Open
Server host to a NetBSD 1.4.2 host (both i386 thingies) using
largely "cpio" with some help from a home brewed wrapper that
provides for root setuid so we're not inhibited by permissions.

The wrapper also provides a bit of additional symmetry, one end
spawning off "find | cpio -o" more or less and the other just "cpio
-i" with a few frills that oughtn't to matter.

We're now introducing a new server, and the natural approach is to
install NetBSD 1.5.2 (thanks, NetBSD developers) and, for experimental
purposes, to start by transferring the data from the backup box,
which is normally quiescent, to the new server, rather than directly
from the SCO server.  After all, the purpose of the brand new
server is to take over some of the SCO server functions because
the latter is straining somewhat.

OK, what's this got to do with networking?  Well, the backup
procedure, down to brass tacks, looks like this:

	backend {opts} | rcmd remote backend {opts}

A while back I complained that "backend" on the remote behaved
oddily, it still does, but that's another issue, I have found a
workaround that I don't believe affects this problem.

Now, the above eventually resolves to the following, where everything
bar the rcmd is executed as superuser, courtesy of setuid "backend":

	find {something} | cpio -i {opts} | rcmd remote cpio -o {opts}

There's a shell script that retrieves lots of useful bits from a
config file to make this happen the way we want it to, including
making sure than only "admin" is allowed to do it.  It's a moderately
safe environment, so security is not critical.

What's going wrong?  Well, the target cpio (on the 1.5.2 host)
doesn't seem to understand that the operation is complete, after
we've transferred some 7.7GBytes of data, and proceeds to consume
99.8% of the available CPU time once the other end of the rcmd link
has dropped the link.  Doesn't seem to do anything with the consumed
time, either.

Killing either end of the connection (cpio on the target, rcmd on the
source - a second rcmd has zombied off in the meantime), brings
normality back, but of course the target cpio does not report a
successful completion.

If I test the combination in various ways, I get the impression
that a minimum volume of data is the trigger for cpio's anomalous
behaviour, which seems to indicate that perhaps the network stack
is feeding cpio with something other than an EOF once the connection
is closed (CLOSE WAIT, I think is the state on that side, FIN-2 on
the other side).

That's to say, I get perfectly rational behaviour if I transfer
small amounts of data, whether I use "backend" or "cpio".  Hard to
test properly, of course, because of ownership issues.  I may have
to set up a test bench, but I hope I won't need to.

I'm presently reverting to SCO as the initiator, as I've never
encountered this problem between SCO and NetBSD 1.4.2, so at least
I'll get one more bit of information, but I suspect the silly
behaviour is with cpio, not rcmd (I have also been using rsh in
the past, but I thought rcmd would be simpler, SCO's rsh is a
different animal).

Unfortunately, the transfer from SCO is considerably slower, it
seems that SCO Open Server does not handle the Intel Pro/100 or
whatever very well, I had dysmal results a while back which were
improved with a bug fix, but I see a collision warning almost
permanently on the 100BaseTX hub between the SCO server and the
NetBSD 1.5.2 backup destination.  It may be a while before I reach
a point where I can tell if 1.5.2 has broken cpio.

I do note that cpio reports a negative block count on termination
and, seemingly, that it no longer recognises -Hnewc by default on
input, it needs the option explicitly.  I think this wasn't the
case with 1.4.2, but again, I haven't the opportunity to check (and
I may be mistaken).

Any suggestions on what additional steps to take gratefully accepted.
The only recommendation I'll have trouble with is to replace cpio
with tar or pax.  Tar may be OK, but I'm not thrilled about it,
pax I'd have to shoehorn onto the SCO server and I'm not too keen.
It's an old version of SCO Open Server, I don't want to tickle any
more unpleasant behaviour from it.

Interestingly, the new server is a dual-CPU Intel Pentium III host
(Tupelo motherboard), I'm hoping to get NetBSD-MP on it before it
becomes obsolete.  While I'm still testing, I could install an MP
kernel on it, if anyone is interested.

++L