tech-kern: Re: CVS commit: src

Subject: Re: CVS commit: src
To: Mindaugas Rasiukevicius <rmind@netbsd.org>
From: Andrew Doran <ad@netbsd.org>
List: tech-kern
Date: 04/30/2007 23:13:18
Hi,

On Mon, Apr 30, 2007 at 02:44:31PM +0000, Mindaugas Rasiukevicius wrote:

> Modified Files:
> 	src/distrib/sets/lists/comp: mi
> 	src/include: Makefile
> 	src/lib/librt: Makefile
> 	src/sys/conf: files
> 	src/sys/ddb: db_command.c db_interface.h db_xxx.c
> 	src/sys/kern: kern_exit.c kern_fork.c syscalls.master
> 	src/sys/sys: Makefile proc.h unistd.h
> Added Files:
> 	src/include: aio.h
> 	src/lib/librt/sys: Makefile.inc
> 	src/sys/kern: vfs_aio.c
> 	src/sys/sys: aio.h
> 
> Log Message:
> Import of POSIX Asynchronous I/O.
> Seems to be quite stable. Some work still left to do.
> 
> Please note, that syscalls are not yet MP-safe, because
> of the file and vnode subsystems.

Cool! I have some more comments & suggestions now that this is in the tree.

o It belongs in sys_aio.c for a couple of reasons. The file implements only
  syscalls and direct support for them, and it's not a VFS component as
  such. It should be possible to do AIO on other types of file descriptor.

o I'm still not fond of copying out the state to userspace only for the
  kernel to copy it back in later when asked for it. I'm not familiar with
  the ins and outs of the POSIX AIO stuff, but I'd prefer to see some kind
  of job control block stick around in the kernel until userspace calls in
  to collect it. Would that be possible?

o Holding p->p_mutex across mutex_init and pool_init is dicey since they can
  sleep in order to allocate memory.  There are number of other things that
  depend on p_mutex, so if an LWP sleeps here it could tie other LWPs up in
  knots waiting. What I suggest to fix that is a flag in p_flag and perhaps
  another condvar that says "AIO is being initialized". Other threads that
  end up in aio_init would wait for that flag to clear, then return when it
  has cleared:

	mutex_enter(&p->p_mutex);
	if ((p->p_flag & PK_AIOINIT) != 0 || p->p_aio != NULL) {
		/* Already initializing or initialized. */
		while ((p->p_flag & PK_AIOINIT) != 0)
			cv_wait(&p->p_mutex, &p->p_somecv);
		if (p->p_aio != NULL) {
			mutex_exit(&p->p_mutex);
			return;
		} else {
			/*
			 * Another thread's initialization attempt
			 * may have failed: retry.
			 */
		}
	}

o Testing for p->p_aio to be non-null appears to be unsafe as it can
  currently be non-null while AIO is not fully initialized. To fix that
  the worker thread could also check for the "AIO is initializing" flag in
  p_flag, and wait until it clears. Then, we could set p_aio as the last
  action in aio_init, like this:

	/* No longer initializing. */
	mutex_enter(&p->p_mutex);
	p->p_aio = aio;
	p->p_flag &= ~PK_AIOINIT;
	cv_broadcast(&p->p_somecv);
	mutex_exit(&p->p_mutex);

o In aio_exit, it's better to cache the value of p_aio, otherwise it will
  generate a lot more machine code. If you don't believe me, try it both
  ways and use objdump -d to check :-)

o In aio_worker, we should check for values other than EINTR too, so != 0
  may be a better choice:
			if (cv_wait_sig(&aio->aio_worker_cv,
			    &aio->aio_mtx) == EINTR) {

o In aio_process, it looks like this will leak file::f_slock:

		if (auio.uio_resid > SSIZE_MAX) {
			error = EINVAL;
			goto done;
		}

o In aio_enqueue_job, this check doesn't make a lot of sense since e.g
  copyin() can block.

	/* Check for the limit */
	if (aio_jobs_count + 1 > AIO_MAX) /* XXXSMP */
		return EAGAIN;

  Really you should check the value and increment it in one go. With the
  atomic ops API that would look something like this, but on a UP system
  the idea is the same: do the check and allocate the resource at the
  same time.

	do {
		u_int count = aio_jobs_count;
		if (count + 1 > AIO_MAX)
			return EAGAIN;
	} while (atomic_cas_int(&aio_jobs_count, count, count + 1));

o In aio_enqueue_job:

	/*
	 * Look for already existing job.  If found - the job is in-progress.
	 * According to POSIX this is invalid, so return the error.
	 */

  This doesn't make sense because we unlock immediately after the check, but
  without reserving the job ID. More generally the subsequent calls to e.g. 
  aio_init or copyout can block and a job with the same ID can be enqueued. 
  Traversing the list will get really expensive so some kind of a hash would
  be good.

o In sys_aio_cancel, copyout is being called with aio_mtx held, That's valid
  but will choke up anything else in the process that tries to use AIO if the
  thread blocks while copying out the structure.

Cheers,
Andrew