Subject: Re: FreeBSD Bus DMA (was Re: AdvanSys board support)
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Justin T. Gibbs <gibbs@plutotech.com>
List: tech-kern
Date: 06/12/1998 15:31:45
>Err, I think maybe I miscommunicated what I meant...
>
>The callback doesn't say "Ok, now run this specific job", but rather "Hey,
>driver: Go run your queue!  You have resources now."

How many queued up clients does it wake up?  How do you ensure fairness
between clients as the first one you wakeup could have queued more requests
since the time of deferment?

> > I certainly see the merits of preemtive kernel threads, but I'm hoping you
> > can clarify how you intend them to be used.  For instance, the CAM SCSI
> > layer currently uses an SWI to do mid-layer routing and command completion.
> > I envision this being replaced by a thread per CPU in the system to allow
> > parallel completion processing.  Even in this situation, I don't want these
> > threads to block if it can be avoided.  There are certain situations having
> > to do with bus, target, or lun rescanning where blockage for memory 
> > resources can occur, but this is such a rare event that it would be foolish
> > to optimize for it.  Bus dma operations, however, occur all the time.  I
> > don't want my driver thread to block on a bus dma operation if this means
> > it cannot service other asynchronous tasks such as a command completion.
> > I've heard of people solving this by giving each I/O it's own thread 
> > context, but this seems like a recipe for an unscalable system.  I would
> > expect that the best approach would be a combination of multiple threads
> > and callbacks so that threads can be staticly allocated for given tasks
> > and additional thread allocation to avoid blockage on deferments can be
> > avoided.
>
>There wouldn't be a "thread for SCSI per CPU"... there would be a
>"thread per instance of a SCSI driver".  Also, with kernel threads,
>the need for software interrupts goes completely away; all you do
>is wake up the thread you want to run.  (Software interrupts still
>have the problem that they run in interrupt context; I want interrupt
>context to largely go away.)

This was the reason I suggested replacing the CAM SWI mechanism with
a thread per CPU.  More on this later.

>I.e. if you have 4 BusLogics in your system:
>
>	bha0	has its own thread
>	bha1	has its own thread
>	bha2	has its own thread
>	bha3	has its own thread
>
>..these threads can run on any CPU.  And since whenever bha driver code
>runs, it will be running in its own context, it can always block and
>never defer.  Even while asleep (blocking), the upper level will be
>able to queue jobs.

Are you talking about having two threads per card?  Otherwise, how would
you avoid a blocked incoming request preventing the service of a command
completion?  The completing command may well be holding the resource you
are blocked on.

>The algorithm might look like this:
>
>/*
> * bha_run_queue:
> *
> *	Run our job queue.  Called when our thread is awakened by the
> *	upper level SCSI code.
> */
>void
>bha_run_queue(sc)
>	struct bha_softc *sc;
>{
>	struct scsipi_queue *scsiq = &sc->sc_link.scsi_queue;
>	struct scsipi_xfer *xfer;
>	int error;
>
> again:
>	/* Grab job off queue. */
>	simple_lock(&scsiq->scq_slock);
>	xfer = TAILQ_FIRST(&scsiq->scq_jobs);
>	simple_unlock(&scsiq->scq_slock);
>
>	/* No work to do, just return. */
>	if (xfer == NULL)
>		return;
>
>	/* ... */
>
>	/* Map the transfer. */
>	if ((error = bus_dmamap_load(sc->sc_dmat, map, xfer->xs_buf,
>	    xfer->xs_buflen, xs->xs_proc, BUS_DMA_WAITOK)) != 0) {
>		/*
>		 * Since we can block, this truly is an error, not
>		 * just a resource shortage.
>		 */
>		xfer->xs_error = error;
>		thread_wakeup(xfer->xs_waiter);
>		goto again;
>	}
>
>	/* Start the job. */
>	...
>
>	/* Look for more work. */
>	goto again;
>}

This works for handling the conversion of the incoming command to card
actions, but you also need a task (or a task per CPU for maximum 
concurrency) to run the mid level SCSI layer and the cross device
scheduling that goes on there.  You don't want to do this based on a 
blocked process context otherwise you can't cleanly handle things like
async I/O without a thread per I/O context.

>Note that in my world, the driver would never be invoked directly
>by the upper-level SCSI code.  That code merely locks the driver's
>queue, puts the job on the end, unlocks it, and wakes up the driver's
>thread.

Sure.  Sure.

>In the event the driver is blocking on the map load, the thread_wakeup()
>by the upper level won't actually wake it up, which has the effect
>of freezing the driver's queue, thus enforcing the necessary ordering.

My concern here is that there may be other things this thread could be
doing if it didn't have to block in this way.  Blocking a thread is one
way to handle deferments, but I'm not convinced it is the most efficient
way in all cases.

>(When the driver instance is idle, obviously it will be sleeping on
>some well-known address, so that both the upper-level and the interrupt
>stub can wake it up.)
>
>Since the driver instance always runs in its own context, it knows that
>it can always block if it has to, and never has to defer any requests.

So long as you dedicate a thread to all tasks that you want to occur in
parallel, yes, this works.  For your example, it seems you would need
an interrupt processing thread too.

>This has other benefits, too... since all driver instances are scheduled,
>based on varying priority (just like regular processes), you won't encounter
>livelock conditions when you're being pounded with interrupts from
>e.g. your gigabit ethernet interfaces.

An interesting concern about interrupt handlers is that, on some hardware,
deferring the clearing of the interrupt is problematic.  In other words,
the processor won't stop interrupting you until you acknowledge the
interrupt.  Doesn't this cause problems with having an "interrupt thread"?
In the CAM layer, all of the interrupt handlers are extremely short anyway;
dequeue the completed command from the hardware, queue it to the mid-layer,
fire SWI (or wakeup thread).  In this situation, I don't see that anything
other than additional latency is added by having a dedicated thread to do
interrupt processing.  This is especially true if you must dequeue the work
in an interrupt context anyway in order to clear the interrupt condition
and allow normal kernel threads to run.

>Jason R. Thorpe                                       thorpej@nas.nasa.gov
>NASA Ames Research Center                            Home: +1 408 866 1912
>NAS: M/S 258-5                                       Work: +1 650 604 0935
>Moffett Field, CA 94035                             Pager: +1 650 428 6939

--
Justin