Subject: Re: Not beer, or why is the pipe so small?
To: Viktor Dukhovni <viktor@dukhovni.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 02/25/2003 12:18:12
On Tue, Feb 25, 2003 at 11:50:09AM -0500, Viktor Dukhovni wrote:
> On Tue, 25 Feb 2003, Thor Lancelot Simon wrote:
> 
> > I still regard what Postfix is trying to do as fundamentally broken.
> > However, it's interesting that the typical portable approach for dealing
> > with this problem on streams doesn't work:
> >
> > 1) Select for write
> > 2) Set descriptor non-blocking
> > 3) Write
> > 4) If EWOULDBLOCK returned, start over
> 
> A non-blocking descriptor that selects ready for write, must not return
> EWOULDBLOCK, it needs to have at least PIPE_BUF bytes free, and return

That's not generally true.  For *pipes*, it is true only due to the
special-casing of PIPE_BUF in the standard, which is IMHO stupid
and harmful.

I think we're looking at an example of _why_ it is harmful, in fact:
because what the programmer really needs is neither the usual stream
behaviour of "can I write at least 1 byte?" nor the standard's behaviour
of "can I write at least PIPE_BUF, which may vary from system to system?"
but rather "Can I write _this much_?" and thus programmers end up
making unwarranted assumptions about PIPE_BUF size.

> with at least the smaller of PIPE_BUF or the requested size written to the
> pipe.

> > The approach that should work with traditional 4BSD systems (including
> > NetBSD older than 1.6) that implement pipes using socketpairs is to
> > adjust the socket's low-water mark with setsockopt(), which will fail
> > harmlessly on systems where this is not supported.  I quote the
> > setsockopt manual page:
> >
> 
> Yes, I am looking for pipes that have something equivalent to a 4K
> low-water mark. Since the low-water mark for pipes (see above) is at least
> PIPE_BUF, I am by implication looking for PIPE_BUF >= 4K. It would suffice
> for PIPE_BUF to be 512 bytes, but for the (hardcoded in the kernel)
> low-water mark to be 4K.

>From my point of view, since the standard permits PIPE_BUF to vary from
system to system, an application *must* tolerate a PIPE_BUF smaller than
4k -- or 256k, or 2k, or whatever.  We already provide a method for
adjusting the low-water mark: just set the appropriate socket option.
This will fail harmlessly on systems on which pipes aren't implemented
with sockets (e.g. FreeBSD, OpenBSD, NetBSD >= 1.6), so no harm done.
The application still has to function correctly regardless of the value
of PIPE_BUF; but if it is written in such a way as to require PIPE_BUF
to be larger in order to function efficiently, we already provide that.
I don't see what the problem is.  The technique is hardly NetBSD-specific,
in any case; it is applicable to any system using the 4BSD implementation
of pipes.

Things are somewhat different with the pipe code that uses the VM
system to flip pages.  At the very least, however, the setsockopt() will 
be harmless.  I'll look into this a bit and see if I can't come up with
a way for Postfix to guarantee what it wants in order for efficient
operation.

Thor