Subject: Re: Machine-independent bus DMA interface proposal
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Justin T. Gibbs <gibbs@freefall.freebsd.org>
List: tech-kern
Date: 09/22/1996 23:08:25
>On Sun, 22 Sep 1996 21:02:20 -0700 
> "Justin T. Gibbs" <gibbs@freefall.freebsd.org> wrote:
>
> > My main problem with this proposal is that it doubles the space required
> > for sg entries and forces a copy of the bus_dma_segment_t information into
> > the private format of the driver.  This seems an enormous penalty
>
>...and there is a _very_ good reason for this, namely hardware-format
>scatter/gather descriptors should _never_ be accessed as structures
>in a portable driver.  This just loses completely.  The fact that NetBSD
>drivers currently do this type of access is a bug, which will have to
>be fixed if those drivers want to be used on an architecture with more
>strict structure alignment and packing restrictions.

I think we're missing each other here.  Forcing the device driver to do the
copy from a bus_dma_segment_t into its own private format doesn't solve the
structure alignment problem.  In fact it forces each individual driver to
handle it much the same as is done (or not done) now.  The only difference
is that instead of handling the mapping (and packing if necessary) inline,
you get an intermediate format in there first.  Having a family of
functions that can be ported to different architectures and know how to
deal with the SG formats completely removes the knowledge of how X arch
aligns its data.  My driver passes in a buffer to hold X number of SG
segments to one of these routines, and the routine, not my driver, ensures
that the packing is confomant with that SG format.

>In terms of performance penalty, in general, I'm not completely
>convinced that
>
>	a) it's really going to be significantly more expensive, and
>
>	b) that the (probably marginal) performance win is worth
>	   the architectural compromise.

I haven't been convinced that there is an architectural comprimize.  Right
now I'm looking at more code, more memory usage (multiple K), and an extra
copy of usually 16 u_int32_t per transaction in an inefficient for
loop at the driver level.  If my device doesn't need bouncing, what
have I gained?

>At the same time, you want the access to the software scatter/gather
>lists to be sane from the programmer's perspective.

I don't see why allowing the user to specify the target SG format and
provide a target buffer makes this interface any harder for the programmer
to deal with.  You can still put whatever information you need in order
to be "sane" in the dma_handle since it is an opaque type.

>Also, the point of this interface is to be device-, bus-, and
>machine-independent.  In other words, the fact that 3 PC scsi cards
>use the same scatter/gather list format is completely irrelevant in
>the scope of designing such a DMA interface.

I would guess that there are quite a few more devices that use the
32bit addr, 32bit count format.  They also are not PC scsi cards,
they are PCI, EISA, and ISA SCSI cards that can be used on multiple
architechtures.  Furthermore, the interface is not device, bus or
machine dependant, it is simply SG format dependant.  The SG format
dependencies have to be dealt with *somewhere*.  My point is only
that by handling them in the interface itself, you gain efficiency
and code reuse.

> > correspond directly to the private driver format.  I would much rather see
> > a family of functions that handle different sg formats which allows the
> > code to be shared among drivers (e.g. the ahb, bt, and aic7xxx, with the
> > exception of length of list have the same format) and kept in one place so
> > that they are easy to port among archs and update if the API changes.
> > This removes the need for translation code in each driver and no extra
> > mapping space is needed.
>
>It's not clear to me that this addresses the concern of bus- and
>machine-independent DMA.  I.e. what you're suggesting would
>basically require machine-dependent portions for this "scsi card dma
>scatter/gather descriptor" function.  "Yuck."  Besides, to address
>the DMA mapping problem, you'd _still_ need an interface like this
>one, so your suggestion would actually be _more_ expensive.

If the code to copy the bus_dma_segment_t entries into the private
driver format cannot be accomplished in an MI way, then you may need
to have MD versions of each of these SG routines.  If an MD solution
is required, you'll need it anyway, it will simply be at the driver
level where it cannot be shared among drivers.

I fail to see how this can be more expensive.  You basically add some more
information to your request: an sg_format_t, and a target buffer.  If an
arch needs to store additional information to do DMA mapping, it can be
tacked onto the opaque handle that is returned.

> > If you don't have a family of functions, the interface has to be enhanced
> > to deal with the restrictions of the different SG formats.  I don't see
> > a per SG size limit in the API and this varies from device to device.
>
>A limit on the DMA segment size is a good suggestion ... (That's why I
>posted this now :-)  As such, I've added a "maxsegsz" argument to
>bus_dmamap_create() which specifies the maximum number of bytes
>that may be trasfered by any given DMA segment.
>
>I don't understand how the interface needs to change to deal with
>different scatter/gather formats...

If it doesn't, you pay a performance and code reuse penalty.

>Under _NO_ circumstances should
>a driver make any assumptions about the size and layout of the
>bus_dma_segment_t ...

With my proposal, the driver could only make assumtions about its
own SG format and wouldn't even have to know how to build it.

>I also don't see how this proposal (which has the effect of ripping
>the vtophys() and kvtop() calls out of the drivers) is really more
>of a lose than what we currently have.

It's slower.  You get the vtophys crap out of the drivers with my approach
too.  You also get all of the structure packing dependancies out of there.

> > I would also like to see the interface deal with DMA transactions that span
> > multiple contiguous KVAs (aka buffers).  I should be able to read tapes
> > from an SGI that use a 256k block size even if my tape drive is hanging
> > of an old 1542(16 SG segments).  Heck, I should be able to read tapes
> > written on a NetBSD system with a 256k block size too. I don't see any
> > of this happening unless we can circumvent the MAXBSIZE/MAXPHYS limits
> > by spanning buffers in a single I/O transaction.
>
>Hmm ... interesting idea ... However, this means changing ... a number
>of things... For example, how does one have multiple bufs in the first
>place?  This could potentially mean chainging the interface to
>a device's strategy routine (unless I'm missing something totally
>obvious)... Seems beyond the scope of this proposal.

Yes, you'd have to change a lot, but we shouldn't add yet another block to
removing these limits.  I think that you could get 75% of the way there by
chaining buffers and modifying the strategy routines to traverse the list,
but fixing the buffer interface is something for another discussion.

> -- save the ancient forests - http://www.bayarea.net/~thorpej/forest/ -- 
>Jason R. Thorpe                                       thorpej@nas.nasa.gov
>NASA Ames Research Center                               Home: 408.866.1912
>NAS: M/S 258-6                                          Work: 415.604.0935
>Moffett Field, CA 94035                                Pager: 415.428.6939
>

--
Justin T. Gibbs
===========================================
  FreeBSD: Turning PCs into workstations
===========================================