Subject: Re: Request for comments: sharing memory between user/kernel space
To: Zeljko Vrba <zvrba@globalnet.hr>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 03/20/2007 15:48:19
--8P1HSweYDcXXzwPJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Tue, Mar 20, 2007 at 09:00:21PM +0100, Zeljko Vrba wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>=20
>=20
> Hi! I was thinking to employ the following scheme for user-kernel
> communication. The scheme is targetted towards writing highly-concurrent,
> asynchronous applications. Roughly, the kerenel would expose towards
> applications an interface similar to the DMA ring buffers/IRQ scheme used=
by
> much of PCI hardware. You might also be reminded of the IO-Lite.
>=20
> I want to write a device that supports only mmap() and ioctl() operations.
> When loaded, the device would allocate and lock (I believe the BSD's corr=
ect
> term is "wire"?) a contiguous[1] chunk of physical memory. The chunk's s=
ize
> would be specified at the module load time, and to keep things simple,
> currently I do not have in mind resizing. A process[2] may mmap() the re=
gion
> in its virtual address space as read/write. This establishes the shared
> memory area between the processes and the kernel.
Actually, what are you really tring to do? You've dived into a number of=20
implementation details, and I think there are other ways to do what you=20
really want.
> [1] I intend to use this area for DMA to/from hardware. I'm not sure whe=
ther
> contiguity is required, but it seems that it might make life simpler.
> (Otherwise, what is contiguous in virtual memory might not be contigu=
ous
> in physical, so how to DMA such data? Does kernel already handle thi=
s?)
The kernel has the ability to deal w/ dma to non-contiguous physical=20
memory.
> [2] It might be wise to restrict this functionality to only eg. root-owned
> processes.
>=20
> I want to use this area for all network and raw disk I/O to/from the
> application. Use cases:
As above, why?
> 1. Application wants to write data to disk[3]/network. The application u=
ses
> ioctl() to allocate a chunk of memory from the shared memory. It fills t=
he
> chunk with data, and calls writev() on the appropriate file descriptor to=
send
> the data to disk/network. writev() would be non-blocking, and return
> immediately. When the transfer is complete, the kernel sends a completion
> notification[4] to the application. The application is now free to reuse=
the
> buffer or free it by ioctl().
>=20
> The writev() call would do its usual function, minus data copying/VM
> manipulations.
You're mostly describing aio. Why not just do aio?
> [3] I would drop the issue of synchronization with FS buffers for now. T=
his
> would be used to access only raw disk devices.
>=20
> [4] I thought of using real-time signals. They are a) queued, b) can car=
ry
> additional info. I'm still thinking about a way to "tag" individual
> readv()/writev() requests (or even individual elements of the iovec
> array), since the file descriptor is not enough to identify completio=
n of
> one of multiple operations.
As above, this is almost aio. Why reinvent the wheel?
> "Completion" would be protocol-specific. In case of eg. TCP connecti=
on,
> this would correspond to ACK received from the other end.
>=20
> 2. Application wants to read data from disk. Similarly to 1., it allocat=
es
> buffers by ioctl(), fills the iovec with addresses of buffers, and calls
> readv() which would also return immediately. The application would recei=
ve a
> real-time signal on completion. The intent is to read data from disk dir=
ectly
> into the application-specified buffer. [Although mmap() offers similar
> functionality, the access to mmap()'d file may block, and there's no
> completion notification.]
>=20
> 3. Receiving data from the network. It can be arranged similar to the ca=
se
> with disk. If the application isn't expecting to read any data (=3D no b=
uffers
> allocated, no readv() issued) the kernel would just drop packets destined=
[7]
> to it. An alternative is to modify[5] the readv() call as follows:
> - on input, ignore the contents of the iovec argument, and interpret the
> count as the number of free places in the iovec array
> - when some data arrives, fill the iovec array and send signal to the
> application
> - the kernel would allocate network buffers in the shared area, in the =
usual
> way (as is done now), and drop packets if it can't allocate memory
>=20
> [5] It might be best not to overload writev/readv with different semantic=
s,
> but to just use ioctl() for everything.
>=20
> [7] How to decide what is destined to the application? How hard would it=
be
> to integrate this with the packet filter?
I think you should look at the RDMA protocols and protocols that use them.
Others have already solved these problems. :-) And there are a lot of=20
details to get right.
> Linux with its splice/vmsplice/tee system calls comes close to this, but =
if I
> haven't overlooked something, they have problems splicing sockets to sock=
ets,
> the mechanism isn't 0-copy when receiving data from the network, and they
> still lack completion signalling in the case of vmsplice [so that the
> application knows when it's safe to modify the data].
>=20
> I'm aware of potential issues such as possible data corruption if the
> application modifies data before it has been written, possible security
> problems if there's the kernel's "control data" has to be allocated toget=
her
> with the bulk data in the shared region, etc. The first case is plain
> incorrectly written application, the 2nd case I'm willing to ignore for t=
he
> moment.
>=20
> So, what I'm interested in is:
>=20
> - is it possible at all in NetBSD?
No. Not now.
> - would it require a comprehensive rewrite of IO/NET/VM subsystems (=3D=
is
> it feasible to do in a time-frame of ca. 3-4 months[6])
I doubt all of this could be done in 4-6 mo.
The raw disk interface is available, so aio on raw disk devices would=20
work.
> - has it already been tried (& failed :))
> - criticisms of the idea :)
Some things are already available. For instance, we have zero-copy TCP=20
write using the vm system. And the vm system method is set up so that=20
writing to the data while it's piped-out for transmit will do the right=20
thing (trigger a COW).
> [6] I consider myself an expert C programmer, but I haven't coded anythin=
g for
> the NetBSD kernel yet (I did browse sources though; they seem much better
> organized, "cleaner" and better-documented than Linux sources, so I'd pre=
fer
> to code this for NetBSD). If I embark on this project, I'd rely on the h=
elp
> of the experts on the list when I get stuck.
My main suggestion is to step back and talk about what you really want to=
=20
do. There may well be easier ways to hook everything together to get there=
=20
both more quickly and in a more-reusable manner.
Take care,
Bill
--8P1HSweYDcXXzwPJ
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)
iD8DBQFGAHLDWz+3JHUci9cRAsbCAJ925U6NeX4qRybH+1raWpJgJTNSFwCgka3G
BlwRCYUhtPqcGzs5H48f/Xk=
=TW+n
-----END PGP SIGNATURE-----
--8P1HSweYDcXXzwPJ--