Subject: Re: dd and pipe question
To: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
From: Greg A. Woods <woods@most.weird.com>
List: current-users
Date: 04/12/1996 10:32:07
[ On Thu, April 11, 1996 at 08:14:18 (-0400), der Mouse wrote: ]
> Subject: Re:  dd and pipe question
>
> > gzip -dc foo.tar.gz | dd ibs=10240 obs=10240 conv=sync of=/dev/tape
> >
> > However, (probably because of the pipe) there are _multiple_ partial
> > input blocks.
> 
> Yes.  dd is not the right tool here.  The effect you want can be
> expressed as "dd ibs=1 obs=10240 ....", but that is ruinously
> inefficient.
> 
> Unfortunately, I know of no other tool shipped with NetBSD that is
> capable of blocking out writes like this.

I would have said the effect desired was to read with an input block
size equal to the volume of the pipe.  This is indeed what dd(1) does
[is supposed to do] on normal UNIX implementations if both ibs and obs
are set to a value larger than the volume of a pipe.

Of course there are lots of broken dd's out there that either have the
effect of setting ibs=1, or indeed some which pad each output block for
every input block.

So, indeed the use of dd in the original posting should be both optimal
and correct, given that all other dependencies are correct (though one
need only set "bs=10240").

>  I have one, called
> "catblock", designed specifically for reading its input as a
> byte-stream and writing it as a block-stream.

I've not yet tried to explore all of the corner and edge conditions in
your code to find just when the padding operation will be done.  However
it seems to me that it is in general wrong to ever pad out any block
except the very last block.  Catblock may only pad the last block, but
the code is sufficiently obtuse as to mask this feature of the algorithm
(at least in my reading of it).

If indeed catblock can pad blocks other than the last one, then it is
solving the wrong problem.  It may indeed work OK with tar in ideal
circumstances, but I'd fear it could terribly mess up a cpio archive,
and perhaps even a dump file, and it certainly would mess up any
non-block structured file if pipe semantics work oddly, or if you're
using a remote TCP/IP socket to connect the processes.

If we assume the UNIX I/O model, and we assume the output block is
always going to be bigger than the input block, then I would think the
algorithm should be something like the following:

unset input empty flag;
set output block empty;
while input not empty {
	while output block not full {
		try to read just enough data to fill output block;
		if error reading {
			exit indicating the input error;
		}
		if EOF on input {
			set input empty flag;
			break;
		}
	}
	if EOF on input {		/* purely a safety optimization */
		/* this is not necessary if ouput block is first zeroed */
		if output block not full {
			pad output block to fill it with zeros;
		}
	}
	write the output block;
	if there is an error or EOF writing {
		exit indicating the output error;
	}
}		

Catblock seems to do this, though the padding in the padwrite() function
scares the hell out of me, and I'd never write it that way!  ;-)  Perhaps
I'm just being tripped up by your naming conventions....

Note that a traditional UNIX implementation of dd(1) essentially does
(or should by definition) implement the algorithm I describe, if you
just set set both ibs and obs to the desired output block size, i.e.
"bs=OUTPUT_BLOCK_SIZE"; and of course if you also set "conv=sync" the
last block will also be padded and avoid a write error on a true block
device.  I think GNU dd now does this, though earlier versions were, as
I recall, broken in that they padded every ouput block if less data was
read than expected, which often happened on a socket.

I recall finding this (or a related) bug when writing remote backup
scripts in an AIX-3.1 environment, where the default archive tools were
very un-forgiving, and the tape device properly enforced the QIC block
structure.  IBMers do know how to deal properly with block I/O and tape
drives, but GNU dd didn't.  I don't remember why I ever even tried GNU
dd though....

(Indeed even a normal stdio implementation of cat(1) should work with
optimum I/O efficiency, provided it sets the i/o buffer size bigger than
the pipe volume and you don't require the padding of the last block.)

-- 
							Greg A. Woods

+1 416 443-1734			VE3TCP			robohack!woods
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>