Subject: Re: Request for comments: sharing memory between user/kernel space
To: Allen Briggs <briggs@netbsd.org>
From: None <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 03/21/2007 08:25:31
In message <20070321150331.GC11760@canolog.ninthwonder.com>,
Allen Briggs writes:

>On Wed, Mar 21, 2007 at 03:57:51PM +0100, Zeljko Vrba wrote:
>> I'm aware these problems.  But your 'usually' does not apply to my case: I'
>m
>> willing to give up on the stream abstraction on the application level.  The
>> kernel currently already does everything on the packet-level, and copies da
>ta
>> to a contiguous user buffer.  Why not just return to the user-level an arra
>y
>> of iovecs with proper <pointer,length> pairs referring to the TCP data
>> payload.  It's the work that the kernel does anyway, I'd 'just' (quotes sin
>ce
>
>Sort of.  The main problem I can think of is that for this to work, I
>think you'd have to have at least essentially all mbufs in memory that
>is accessible from both the kernel and user space.  So your process
>gets to see all network traffic.  Is that OK with you?
>
>Obviously, this is not something that's general-purpose.

A general-purpose implementation is doable. One just has to allocate
an entire physical page to each mbuf, or mbuf cluster; and zero out
"stale" data on freeing any mbuf or cluster.  That way the kernel
never leaks any data *NOT* from the TCP stream in question into the
user application space.

Given the zeroing, I'm not sure it'd be worth the cost.

>I think just the address space differences.  Translation, mapping, and
>hand-off.  It would be really easy, I think, for the user application
>to crash or at least starve the kernel.

One needs to extend the API so that the userspace process and the
kernel both agree on when the pages returned as an iov can be numapped
from the user address space (and thus reclaimed).  That agreement can
be implicit: for example, a well-defined, per-socket limit for
"ring-buffer"-style semantics of the underlying memory.

It's been done, but it's easier to do if one give up on strict
Ethernet framing or inserts a shim header. You really want the NIC to
have a separate DMA pool for each "socket"-level connection, to
(amongst other issues) avoid the leakage problem. That obviously means
the NIC has to be able to demux each frame to a connection-specific
buffer pool.  Doable, but not yet (AFAIK) in off-the-shelf NICs, with
the exception of fullblown TCP offload engines (TOEs): as done by 
iSCSI offload "HBAs", or RDMA-capable NICs.