current-users: Re: dd and pipe question

Subject: Re: dd and pipe question
To: None <current-users@NetBSD.ORG>
From: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
List: current-users
Date: 04/13/1996 07:24:30
>>> gzip -dc foo.tar.gz | dd ibs=10240 obs=10240 conv=sync of=/dev/tape

>>> However, (probably because of the pipe) there are _multiple_
>>> partial input blocks.

>> Yes.  dd is not the right tool here.  The effect you want can be
>> expressed as "dd ibs=1 obs=10240 ....", but that is ruinously
>> inefficient.

> I would have said the effect desired was to read with an input block
> size equal to the volume of the pipe.

Well, yes, the real desired effect is to take the input as a
byte-stream and block it for the output writes.

This is not what dd is designed for.  If you're copying between two
traditional tape drives (ie, half-inch, or 8mm, or something else that
believes in variable-sized records), you want a less-than-full-size
input record to produce a less-than-full-size output record.  I have
never seen a dd spec precise enough to say how the behavior of dd bs=N
should differ from that of dd ibs=N obs=N, except that "if no
conversion is specified, [the former] is particularly efficient since
no copy need be done".

> This is indeed what dd(1) does [is supposed to do] on normal UNIX
> implementations if both ibs and obs are set to a value larger than
> the volume of a pipe.

I can't see how.  I'd expect dd ibs=10240 obs=10240, or dd bs=10240, to
behave this way:

- read(input,ibuffer,10240) -> get N
- copy N from ibuffer to obuffer
- write(output,obuffer,N) (or 10240 instead of N if conv=sync)

>> I have one, called "catblock", designed specifically for reading its
>> input as a byte-stream and writing it as a block-stream.  [code]

> I've not yet tried to explore all of the corner and edge conditions
> in your code to find just when the padding operation will be done.
> However it seems to me that it is in general wrong to ever pad out
> any block except the very last block.

Agreed.

> Catblock may only pad the last block, but the code is sufficiently
> obtuse as to mask this feature of the algorithm (at least in my
> reading of it).

Look at when padwrite() is called.  There are only three places.  In
two of them, an exit() follows immediately (and thus they apply to only
the last block); the third is controlled by if (ifill == blocksize).
Because blocksize%padsize is verified to be zero, ifill%padsize will be
zero in padwrite() whenever the third case obtains.

> [...] I would think the algorithm should be something like the
> following:  [...]

This is in fact what I'm trying to do with catblock.

> Catblock seems to do this, though the padding in the padwrite()
> function scares the hell out of me, and I'd never write it that way!
> ;-)  Perhaps I'm just being tripped up by your naming conventions....

Welllll...the alternative is to duplicate the padding code in the two
last-block cases.  Granted, the current code performs a remainder
computation and check that's unnecessary for all but the last block.

> Note that a traditional UNIX implementation of dd(1) essentially does
> (or should by definition) implement the algorithm I describe, if you
> just set set both ibs and obs to the desired output block size, i.e.
> "bs=OUTPUT_BLOCK_SIZE";

I don't think so; see my remarks above.  Short of some real definition
of what "traditional [] dd" does, though, this really comes down to my
opinion versus yours.

> (Indeed even a normal stdio implementation of cat(1) should work with
> optimum I/O efficiency, provided it sets the i/o buffer size bigger
> than the pipe volume and you don't require the padding of the last
> block.)

I would hope that cat(1) would never hold data buffered, that as soon
as it read()s some data, it should write() it promptly, even it sees
neither EOF nor more data soon.

					der Mouse

			    mouse@collatz.mcrcim.mcgill.edu