Subject: Re: Request for comments: sharing memory between user/kernel space
To: Zeljko Vrba <zvrba@globalnet.hr>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 03/20/2007 15:48:19
--8P1HSweYDcXXzwPJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Mar 20, 2007 at 09:00:21PM +0100, Zeljko Vrba wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>=20
>=20
> Hi! I was thinking to employ the following scheme for user-kernel
> communication.  The scheme is targetted towards writing highly-concurrent,
> asynchronous applications.  Roughly, the kerenel would expose towards
> applications an interface similar to the DMA ring buffers/IRQ scheme used=
 by
> much of PCI hardware.  You might also be reminded of the IO-Lite.
>=20
> I want to write a device that supports only mmap() and ioctl() operations.
> When loaded, the device would allocate and lock (I believe the BSD's corr=
ect
> term is "wire"?) a contiguous[1] chunk of physical memory.  The chunk's s=
ize
> would be specified at the module load time, and to keep things simple,
> currently I do not have in mind resizing.  A process[2] may mmap() the re=
gion
> in its virtual address space as read/write.  This establishes the shared
> memory area between the processes and the kernel.

Actually, what are you really tring to do? You've dived into a number of=20
implementation details, and I think there are other ways to do what you=20
really want.

> [1] I intend to use this area for DMA to/from hardware.  I'm not sure whe=
ther
>     contiguity is required, but it seems that it might make life simpler.
>     (Otherwise, what is contiguous in virtual memory might not be contigu=
ous
>     in physical, so how to DMA such data?  Does kernel already handle thi=
s?)

The kernel has the ability to deal w/ dma to non-contiguous physical=20
memory.

> [2] It might be wise to restrict this functionality to only eg. root-owned
>     processes.
>=20
> I want to use this area for all network and raw disk I/O to/from the
> application.  Use cases:

As above, why?

> 1. Application wants to write data to disk[3]/network.  The application u=
ses
> ioctl() to allocate a chunk of memory from the shared memory.  It fills t=
he
> chunk with data, and calls writev() on the appropriate file descriptor to=
 send
> the data to disk/network.  writev() would be non-blocking, and return
> immediately.  When the transfer is complete, the kernel sends a completion
> notification[4] to the application.  The application is now free to reuse=
 the
> buffer or free it by ioctl().
>=20
> The writev() call would do its usual function, minus data copying/VM
> manipulations.

You're mostly describing aio. Why not just do aio?

> [3] I would drop the issue of synchronization with FS buffers for now.  T=
his
>     would be used to access only raw disk devices.
>=20
> [4] I thought of using real-time signals.  They are a) queued, b) can car=
ry
>     additional info.  I'm still thinking about a way to "tag" individual
>     readv()/writev() requests (or even individual elements of the iovec
>     array), since the file descriptor is not enough to identify completio=
n of
>     one of multiple operations.

As above, this is almost aio. Why reinvent the wheel?

>     "Completion" would be protocol-specific.  In case of eg. TCP connecti=
on,
>     this would correspond to ACK received from the other end.
>=20
> 2. Application wants to read data from disk.  Similarly to 1., it allocat=
es
> buffers by ioctl(), fills the iovec with addresses of buffers, and calls
> readv() which would also return immediately.  The application would recei=
ve a
> real-time signal on completion.  The intent is to read data from disk dir=
ectly
> into the application-specified buffer.  [Although mmap() offers similar
> functionality, the access to mmap()'d file may block, and there's no
> completion notification.]
>=20
> 3. Receiving data from the network.  It can be arranged similar to the ca=
se
> with disk.  If the application isn't expecting to read any data (=3D no b=
uffers
> allocated, no readv() issued) the kernel would just drop packets destined=
[7]
> to it.  An alternative is to modify[5] the readv() call as follows:
>   - on input, ignore the contents of the iovec argument, and interpret the
>     count as the number of free places in the iovec array
>   - when some data arrives, fill the iovec array and send signal to the
>     application
>   - the kernel would allocate network buffers in the shared area, in the =
usual
>     way (as is done now), and drop packets if it can't allocate memory
>=20
> [5] It might be best not to overload writev/readv with different semantic=
s,
>     but to just use ioctl() for everything.
>=20
> [7] How to decide what is destined to the application?  How hard would it=
 be
>     to integrate this with the packet filter?

I think you should look at the RDMA protocols and protocols that use them.

Others have already solved these problems. :-) And there are a lot of=20
details to get right.

> Linux with its splice/vmsplice/tee system calls comes close to this, but =
if I
> haven't overlooked something, they have problems splicing sockets to sock=
ets,
> the mechanism isn't 0-copy when receiving data from the network, and they
> still lack completion signalling in the case of vmsplice [so that the
> application knows when it's safe to modify the data].
>=20
> I'm aware of potential issues such as possible data corruption if the
> application modifies data before it has been written, possible security
> problems if there's the kernel's "control data" has to be allocated toget=
her
> with the bulk data in the shared region, etc.  The first case is plain
> incorrectly written application, the 2nd case I'm willing to ignore for t=
he
> moment.
>=20
> So, what I'm interested in is:
>=20
>   - is it possible at all in NetBSD?

No. Not now.

>   - would it require a comprehensive rewrite of IO/NET/VM subsystems (=3D=
 is
>     it feasible to do in a time-frame of ca. 3-4 months[6])

I doubt all of this could be done in 4-6 mo.

The raw disk interface is available, so aio on raw disk devices would=20
work.

>   - has it already been tried (& failed :))
>   - criticisms of the idea :)

Some things are already available. For instance, we have zero-copy TCP=20
write using the vm system. And the vm system method is set up so that=20
writing to the data while it's piped-out for transmit will do the right=20
thing (trigger a COW).

> [6] I consider myself an expert C programmer, but I haven't coded anythin=
g for
> the NetBSD kernel yet (I did browse sources though; they seem much better
> organized, "cleaner" and better-documented than Linux sources, so I'd pre=
fer
> to code this for NetBSD).  If I embark on this project, I'd rely on the h=
elp
> of the experts on the list when I get stuck.

My main suggestion is to step back and talk about what you really want to=
=20
do. There may well be easier ways to hook everything together to get there=
=20
both more quickly and in a more-reusable manner.

Take care,

Bill

--8P1HSweYDcXXzwPJ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)

iD8DBQFGAHLDWz+3JHUci9cRAsbCAJ925U6NeX4qRybH+1raWpJgJTNSFwCgka3G
BlwRCYUhtPqcGzs5H48f/Xk=
=TW+n
-----END PGP SIGNATURE-----

--8P1HSweYDcXXzwPJ--