Subject: Re: Machine-independent bus DMA interface proposal
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Justin T. Gibbs <gibbs@freefall.freebsd.org>
List: tech-kern
Date: 09/23/1996 22:21:54
>The basis of Justin's arguments is that it's somehow significantly
>more expensive.  However, he has _not_ provided any evidence that
>the additional overhead of copying the bus_dma_segment_t's address
>and length parameters into a device's scatter/gather list is anything
>more than negligible.
>
>As an exercise, I spent a some time actually measuring the expense of
>doing it "my way" (a) vs. "his way" (b).
>
>Basically, I wrote a 2 small C programs ... Each of them calls a function
>to fill an array of segments.  (a) then loops through the array of segments
>copying them manually to a new array of segments.

I'd need to see source to see if this was a valid comparison.  Did you
do use malloc/free in your version?  What about the effect of the extra
memory consumption since any driver that wants to work well will want
to allocate the map once for each "transaction" (eg an SCB) instead of
paying that cost repeatedly.

>For a 1024 segment loop (which is a corner case, but addresses the
>concern Justin has about large DMAs), the amount of time spent in (a)
>was not distinguishable from the amount of time spent in (b).  That
>is, it was so small that gmon could not measure it.
>
>a:
>  %   cumulative   self              self     total
>time   seconds   seconds    calls  ms/call  ms/call  name
>  0.0       0.00     0.00        1     0.00     0.00  _foo [10]
>  0.0       0.00     0.00        1     0.00     0.00  _main [11]
>
>b:
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
>  0.0       0.00     0.00        1     0.00     0.00  _foo [10]
>  0.0       0.00     0.00        1     0.00     0.00  _main [11]

What kind of machine is this?  Certainly the profiler should be able
to discern the difference between one program copying 8k and the other
not.  You must have phenomenal memory bandwidth to achieve this result.
Either that or your profiler is broken.

>So, to simulate running over time, I wrapped the 1024 segment loop
>inside another loop that ran 10,000 times.  Here, you see the two
>diverge a little:
>
>a:
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
> 56.2       6.78     6.78        1  6780.00 12000.00  _main [2]
> 43.3      12.00     5.22    10000     0.52     0.52  _foo [3]
>
>b:
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
> 99.4       5.35     5.35    10000     0.54     0.54  _foo [3]
>  0.2       5.36     0.01        1    10.00  5360.00  _main [2]
>
>
>While "my way" is clearly a bit more expensive (6.64 extra seconds for
>10,000 corner-case calls)

Your version more than twice as slow.

>, I don't see the relative cheapness of "his way"
>as a compelling argument to implement machine-dependent portions of
>otherwise totally machine-independent drivers, while the need for
>machine-independent bus DMA mapping still does not go away.  The alpha
>needs it, the ARC needs it, the pmax needs it, and even the i386 needs it.

I think its time to review the proposals just to make sure that we are
clear.

The current proposal if for an MI interface implemented in a
machine-dependant fashion to setup DMA transfers.  This interface
has the following features:

a) It forces each port to represent the SG segments in an MI format
that will certainly not match the format required by any driver and
in many cases will not be either adequate or the most efficient
representation of the mapping for a particular port.  For example,
the x86 doesn't need any of this extra mapping for EISA or PCI and
for the case of doing bounce buffering will probably stuff the
parent buffer into another hidden member in the bus_dma_segment_t.

b) Consumes on average double the amount of SG list space used by
other methods.

c) Forces an inefficient in-kernel copy of the SG list by each individual
driver.

My proposal:

a) Moves the SG format creation/packing stuff out of the drivers into,
most likely, an MI header file of inline functions.  This allows the
SG format code to be shared amoung drivers that share common formats.

b) Does not require each port to produce an interrum SG format that
may be of no immediate use for either tracking the DMA mapping by
the bus_dma interfaces or for the actual DMA performed by the driver.

c) Gives each port full freedom to define its own data structures
for implementing these features.

d) Still provides an MI interface.

e) Promotes code re-use.

>This is actually somewhat expensive ... you're talking about jumping
>through a function pointer every time you translate a kva to a bus
>physcal address.  That also creates some unneeded spaghetti.

Most definitely.  I was thinking a simple switch statement of the
formats a particular port supports followed by an inline function call.

> -- save the ancient forests - http://www.bayarea.net/~thorpej/forest/ -- 
>Jason R. Thorpe                                       thorpej@nas.nasa.gov
>NASA Ames Research Center                               Home: 408.866.1912
>NAS: M/S 258-6                                          Work: 415.604.0935
>Moffett Field, CA 94035                                Pager: 415.428.6939

--
Justin T. Gibbs
===========================================
  FreeBSD: Turning PCs into workstations
===========================================