Subject: Re: zbufs for NetBSD
To: David Laight <david@l8s.co.uk>
From: Jason R Thorpe <thorpej@wasabisystems.com>
List: tech-kern
Date: 08/22/2002 16:42:12
On Fri, Aug 23, 2002 at 12:17:37AM +0100, David Laight wrote:

 > On Thu, Aug 22, 2002 at 01:04:37PM -0600, kyle.unice@L-3com.com wrote:
 > > I imagine that someone previously has looked into putting zbuf (zero-copy
 > > mbufs) support into NetBSD.  I am interested in knowing what the state of
 > > the project is.   A search of the net reveals little except for VxWorks
 > > support for zbufs.  
 > > 
 > > I would think that the MMU, socket syscalls, and mbuf code would need to be
 > > modified.  The upside is that zbufs provide a network performance advantage.

Hm, zero-copy mbufs.  If I understand you correctly, NetBSD has supported
those for a long time, really.

If an mbuf has "external storage" associated with it, a "copy" from
mbuf a to mbuf b merely causes mbuf b to take a reference to that
external storage.

 > For a unix system careful use of page loaning can help - but only
 > if the process side doesn't write into the loaded page (because that
 > would require a copy-on-write allocation which would end up being more
 > expensive that the original copy).

Yes.  And, in -current, NetBSD now uses "zero-copy mbufs"/page loaning
by default for writes >= 8k to a socket.

Yes, for this to be a major performance win, you need to either use
async i/o of some sort or, as you say, transmit an mmap'd file.

Zero-copy receive is somewhat harder -- the data comes in chunks of less
than one page (more or less), and so you HAVE to copy the data a little
to coalesce it into nice page-sized/page-aligned pieces.  However, once
that is done, you could certainly page-flip if given a nice page-size/page-
algned buffer for the receive.  The threshold of where this has a payoff
is something that needs to be reserched (once it's implemented, obviously :-)

But, in any case, for sending, we're there today.

 > The big gain from page loaning probbaly comes with mmaped file - since
 > the data can be transmitted without ever getting into the cpu cache (and
 > displacing other useful data).  Hardware checksum calculation will
 > make a much bigger difference here...
 > 
 > 	David
 > 
 > -- 
 > David Laight: david@l8s.co.uk

-- 
        -- Jason R. Thorpe <thorpej@wasabisystems.com>