Subject: Request for comments: sharing memory between user/kernel space
To: None <tech-kern@netbsd.org>
From: Zeljko Vrba <zvrba@globalnet.hr>
List: tech-kern
Date: 03/20/2007 21:00:21
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


Hi! I was thinking to employ the following scheme for user-kernel
communication.  The scheme is targetted towards writing highly-concurrent,
asynchronous applications.  Roughly, the kerenel would expose towards
applications an interface similar to the DMA ring buffers/IRQ scheme used by
much of PCI hardware.  You might also be reminded of the IO-Lite.

I want to write a device that supports only mmap() and ioctl() operations.
When loaded, the device would allocate and lock (I believe the BSD's correct
term is "wire"?) a contiguous[1] chunk of physical memory.  The chunk's size
would be specified at the module load time, and to keep things simple,
currently I do not have in mind resizing.  A process[2] may mmap() the region
in its virtual address space as read/write.  This establishes the shared
memory area between the processes and the kernel.

[1] I intend to use this area for DMA to/from hardware.  I'm not sure whether
    contiguity is required, but it seems that it might make life simpler.
    (Otherwise, what is contiguous in virtual memory might not be contiguous
    in physical, so how to DMA such data?  Does kernel already handle this?)

[2] It might be wise to restrict this functionality to only eg. root-owned
    processes.

I want to use this area for all network and raw disk I/O to/from the
application.  Use cases:

1. Application wants to write data to disk[3]/network.  The application uses
ioctl() to allocate a chunk of memory from the shared memory.  It fills the
chunk with data, and calls writev() on the appropriate file descriptor to send
the data to disk/network.  writev() would be non-blocking, and return
immediately.  When the transfer is complete, the kernel sends a completion
notification[4] to the application.  The application is now free to reuse the
buffer or free it by ioctl().

The writev() call would do its usual function, minus data copying/VM
manipulations.

[3] I would drop the issue of synchronization with FS buffers for now.  This
    would be used to access only raw disk devices.

[4] I thought of using real-time signals.  They are a) queued, b) can carry
    additional info.  I'm still thinking about a way to "tag" individual
    readv()/writev() requests (or even individual elements of the iovec
    array), since the file descriptor is not enough to identify completion of
    one of multiple operations.

    "Completion" would be protocol-specific.  In case of eg. TCP connection,
    this would correspond to ACK received from the other end.

2. Application wants to read data from disk.  Similarly to 1., it allocates
buffers by ioctl(), fills the iovec with addresses of buffers, and calls
readv() which would also return immediately.  The application would receive a
real-time signal on completion.  The intent is to read data from disk directly
into the application-specified buffer.  [Although mmap() offers similar
functionality, the access to mmap()'d file may block, and there's no
completion notification.]

3. Receiving data from the network.  It can be arranged similar to the case
with disk.  If the application isn't expecting to read any data (= no buffers
allocated, no readv() issued) the kernel would just drop packets destined[7]
to it.  An alternative is to modify[5] the readv() call as follows:
  - on input, ignore the contents of the iovec argument, and interpret the
    count as the number of free places in the iovec array
  - when some data arrives, fill the iovec array and send signal to the
    application
  - the kernel would allocate network buffers in the shared area, in the usual
    way (as is done now), and drop packets if it can't allocate memory

[5] It might be best not to overload writev/readv with different semantics,
    but to just use ioctl() for everything.

[7] How to decide what is destined to the application?  How hard would it be
    to integrate this with the packet filter?
    
Linux with its splice/vmsplice/tee system calls comes close to this, but if I
haven't overlooked something, they have problems splicing sockets to sockets,
the mechanism isn't 0-copy when receiving data from the network, and they
still lack completion signalling in the case of vmsplice [so that the
application knows when it's safe to modify the data].

I'm aware of potential issues such as possible data corruption if the
application modifies data before it has been written, possible security
problems if there's the kernel's "control data" has to be allocated together
with the bulk data in the shared region, etc.  The first case is plain
incorrectly written application, the 2nd case I'm willing to ignore for the
moment.

So, what I'm interested in is:

  - is it possible at all in NetBSD?
  - would it require a comprehensive rewrite of IO/NET/VM subsystems (= is
    it feasible to do in a time-frame of ca. 3-4 months[6])
  - has it already been tried (& failed :))
  - criticisms of the idea :)

[6] I consider myself an expert C programmer, but I haven't coded anything for
the NetBSD kernel yet (I did browse sources though; they seem much better
organized, "cleaner" and better-documented than Linux sources, so I'd prefer
to code this for NetBSD).  If I embark on this project, I'd rely on the help
of the experts on the list when I get stuck.

Thanks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFGAD1bUIHQih3H6ZQRA9g0AJ47x1Ev/9uJ/HdNyYbfixzpGUEr8wCggrFe
6h7UCaZTGSOLXQ6YU3psp/w=
=UnV0
-----END PGP SIGNATURE-----