Subject: Re: zbufs for NetBSD
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: David Laight <david@l8s.co.uk>
List: tech-kern
Date: 08/23/2002 09:43:32
On Thu, Aug 22, 2002 at 04:42:12PM -0700, Jason R Thorpe wrote:
> On Fri, Aug 23, 2002 at 12:17:37AM +0100, David Laight wrote:
> 
>  > On Thu, Aug 22, 2002 at 01:04:37PM -0600, kyle.unice@L-3com.com wrote:
>  > > I imagine that someone previously has looked into putting zbuf (zero-copy
>  > > mbufs) support into NetBSD.  I am interested in knowing what the state of
>  > > the project is.   A search of the net reveals little except for VxWorks
>  > > support for zbufs.  
>  > > 
>  > > I would think that the MMU, socket syscalls, and mbuf code would need to be
>  > > modified.  The upside is that zbufs provide a network performance advantage.
> 
> Hm, zero-copy mbufs.  If I understand you correctly, NetBSD has supported
> those for a long time, really.
> 
> If an mbuf has "external storage" associated with it, a "copy" from
> mbuf a to mbuf b merely causes mbuf b to take a reference to that
> external storage.

Yes - but the zbuf interface is from 'userspace', making it work
for a unix system would be too inefficient.

>  > For a unix system careful use of page loaning can help - but only
>  > if the process side doesn't write into the loaded page (because that
>  > would require a copy-on-write allocation which would end up being more
>  > expensive that the original copy).
> 
> Yes.  And, in -current, NetBSD now uses "zero-copy mbufs"/page loaning
> by default for writes >= 8k to a socket.
> 
> Yes, for this to be a major performance win, you need to either use
> async i/o of some sort or, as you say, transmit an mmap'd file.
> 
> Zero-copy receive is somewhat harder -- the data comes in chunks of less
> than one page (more or less), and so you HAVE to copy the data a little
> to coalesce it into nice page-sized/page-aligned pieces.  However, once
> that is done, you could certainly page-flip if given a nice page-size/page-
> algned buffer for the receive.  The threshold of where this has a payoff
> is something that needs to be reserched (once it's implemented, obviously :-)

I did wonder about doing a 'page exchange' for reads.  IIRC malloc
will give you page aligned memory if you malloc an 8k buffer.
Reading 8k (from an aligned source) could just give you a COW
version of the associated page (assuming it can be mapped to
the correct virtual address).  This could work for files.

For writes to files you would need a flag to say that the
progran didn't care about the buffer contents after the write
completes (so you can give him a spare page of zeros, or a
page the program had previously written to a file (ie stale data
that isn't a security problem)).  This could be used for stdio
writes?

For pipes (is there some page loaning there) it might work best
is the writer triple buffers data.  Then the reader will free
the 1st page (by swapping it for the 2nd) before the write of
the 3rd completes and the writer starts refilling the 1st page.
Again stdio could do this...

	David

-- 
David Laight: david@l8s.co.uk