Subject: Re: Request for comments: sharing memory between user/kernel space
To: Zeljko Vrba <zvrba@globalnet.hr>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 03/21/2007 08:02:36
--WIyZ46R2i8wDzkSu
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Mar 21, 2007 at 03:57:51PM +0100, Zeljko Vrba wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>=20
> Allen Briggs <briggs@netbsd.org> writes:
> >
> > 0-copy TCP receive is somewhat problematic.  Usually, you want a stream
> > of data, but that data is broken up on the wire into ethernet frames
> > that have essentially arbitrary headers and for which the payload is
> > rarely, if ever, page-sized and page-aligned (even if you're using
> > jumbo frames, cleverly aligned), and they can come in out of order.
> > And interspersed in the TCP stream are other packets--other TCP packets,
> > UDP packets, non-IP, etc.
> >
> I'm aware these problems.  But your 'usually' does not apply to my case: =
I'm
> willing to give up on the stream abstraction on the application level.  T=
he
> kernel currently already does everything on the packet-level, and copies =
data
> to a contiguous user buffer.  Why not just return to the user-level an ar=
ray
> of iovecs with proper <pointer,length> pairs referring to the TCP data
> payload.  It's the work that the kernel does anyway, I'd 'just' (quotes s=
ince
> I have no clue about implementation complexity) like to replace the data
> copying part (which preserves the stream abstraction) with returning the =
iovec
> array where pointers point into application-accessible memory (which brea=
ks
> the stream abstraction, and I don't care about that :))

Well, one thing you could do is add an ioctl on a socket that passes in=20
memory buffers. Then change the socket code so that, on your special=20
sockets, data are copied to these pre-allocated buffers rather than a=20
socket buffer. That will get rid of one copy and would be rather clean.

If there are no buffers, discard the TCP packets and don't ack the range,=
=20
and TCP will still work.

> > mbuf chains--which have the scatter/gather information you want for
> > this. =20
> >
> What is problematic in "exporting" mbuf chains, together with associated =
data
> payload, to the user-level?

Well, one main problem is that the mbufs aren't mapped into the app's=20
address space. And on some architectures, user apps and the kernel are in=
=20
different address spaces. So you'd have to map things in one way or=20
another. Further, you really don't want userland to be able to write to=20
the chains, so you'd need a r/o mapping for the headers and a r/w mapping=
=20
for the data.

One thing I'd tought of in the past (for an iSCSI target) was creating the=
=20
concept of an mbuf cookie. Userland would use a special call to gobble a=20
certain amount of data off of the receive queue for a socket, and those=20
mbufs would hang around in a process-level list. Userland would get back=20
an opaque cookie. The app then decides what it wants to do with the data,=
=20
then hands the cookie to a modified write call. The write call grabs the=20
mbufs indicated by the cookie and uses them as the data for a write to say=
=20
disk. Thus zero copy.

Take care.

Bill

--WIyZ46R2i8wDzkSu
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)

iD8DBQFGAVccWz+3JHUci9cRAmFOAKCRSDpc8BvCgESsUlOYcC1yXP6HmwCeO3xm
9JJax1PpapyJzfVa45GLdmM=
=eW1L
-----END PGP SIGNATURE-----

--WIyZ46R2i8wDzkSu--