Subject: Re: M:N and blocking ops without SA, AIO
To: Matthew Mondor <mm_lists@pulsar-zone.net>
From: None <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 03/01/2007 13:03:40
In message <20070301151926.26e72ae0@hal.xisop>,
atthew Mondor writes:


>Exactly, and this would be especially suited for latency-critical
>single-threaded applications using non-blocking I/O with kqueue for
>sockets, but which also need to do some disk I/O occasionally...  

Yes, that matches what I was thinking of.

>And as
>previously mentionned this might enhance the efficiency of an M:N
>model's userland scheduler.

Except we don't have an M;N userland scheduler in -current, not since
SAs went.


>> Well, the typical way to implement AIO is to have a pool of kernel
>> threads. Grab a kernel thread, issue the i/o, using the kernel thread
>> as the thread which blocks until the I/O is complete.  Then the kernel
>> thread posts completion to the AIO subsystem, which passes appropriate
>> status, signal info, etc. to the requesting thread.  Hmmmm, continuations :-/
>
>If kernel threads are really needed for this, it makes me wonder if a
>pool of LWPs handled by a userland M:N scheduler especially for disk I/O
>would be as efficient...  wouldn't there be a way for the kernel to do
>this more asynchroneously without the need for a kthread per blocking
>I/O call?  Or would this require a multithreading revamp of various
>subsystems? 

Indeed, there are POSIX-API-based, userland-only, threads-based
implementations of AIO. If memory serves, the earlier Linux glibc AIO
implementation was basedon POSIX threads, and remained that way until
ASE was integrated into stock Linux kernels.

But in *BSD kernels, the last time I had to implement AIO-like ops and
kqueue (or kqueue-like), disk I/O happend through the buffer subsystem
and "struct buf"; and the only "I/O done" notification provided was
either to wakeup the process(es) sleeping on the buffer address, or
call a b_iodone callback. Except *(b_iodone)() is for internal use by
the buffered I/O system, and the callback isn't in a context where you
can necessarily do what you'd want, for (for example) a
sendfile()-style operation.   And indeed, what an app really wants is,
arguably, a filesystem-level notification.

So the last time I had to implement sendfile(), I added a kcont-like
thing into the UFS layer, and used *that* to do deferred processing.
But that required API changes, or struct additions, all the way up to
the VFS layer, at least.

Half the time, I think what we'd really want is a hook at the
filesystem layer, analagous to the sorwakeup()/sowwakeup() calls on
sockets.  But it's almost exactly the lack of such a hook that
prevents us from doing select() on disk files the way we can on
sockets[1]. :-/. Note that sorwakeup()/sowwakeup() are wrappers for
sowakeup(), and that's where poll(), select(), wakeup(), SIGIO for
asynchronous SIGIO I/O, etc [2]., are all handled. 

So in summary: in your shoes, I'd keep on with kthreads for now.
There's plenty to be done with getting aio_cancel() right, and doing a
safe implementation of AIO handles.

The POSIX userland API for aio_cancel() uses a user VA. Typical
implmeentations include a cookie in the userland struct aiocb, which
maps, somehow, to the KVA state for the I/O operation. Using a raw
kernel-space pointer is tempting; but leads to problems making the
implementation safe against userland code which hands back stale, or
malformed cookies.  (NetBSD has extra issues with 32-bit userland
emulation on 64-bit kernels, but lets skip that for now.) Most
implementations I've seen closely have had serious security problems
with AIO state and/or aio_cancel() at one time or other.  That's where
I'd put my time initially.  But of course you are free to spend your
time elsewhere.


[1] Well, modulo availability/allocation of buffers, as discussed
earlier in this thread.


[2] the "etc" in sowakeup() is the so_upcall function, used by NFS,
the SMB-client code, and also (mis)used by my kcont/splice(2) patch
from some years ago.