Subject: Re: Machine-independent bus DMA interface proposal
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Dennis Ferguson <dennis@jnx.com>
List: tech-kern
Date: 09/24/1996 09:51:10
> On Mon, 23 Sep 1996 14:53:40 -0700 
>  Dennis Ferguson <dennis@jnx.com> wrote:
> 
>  > What I think Justin is complaining about is that the machine-independent
>  > format used for parameter passing becomes "much more" expensive (relatively)
>  > for buses which would otherwise require no bus-dependent state be kept
>  > at all, as it requires you to allocate something when you could get
>  > by with nothing (since allocating something is always many times more
>  > expensive than allocating nothing).  Since buses which require no
>  > bus-dependent mapping state are quite common, this does cost.
> 
> The basis of Justin's arguments is that it's somehow significantly
> more expensive.  However, he has _not_ provided any evidence that
> the additional overhead of copying the bus_dma_segment_t's address
> and length parameters into a device's scatter/gather list is anything
> more than negligible.
> 
> As an exercise, I spent a some time actually measuring the expense of
> doing it "my way" (a) vs. "his way" (b).

Note that what you're measuring doesn't include what I indicated was
"expensive".  I don't think parameter-passing through an array is
all that "expensive", but what I do think is expensive is allocating
and freeing the array for the sole purpose of passing the parameters.
If you include a malloc() and free() in your loop you might get a
better estimate of the "expensive" case.

Again, for some buses, for example a PCI bus on a PC, there is no need
to carry bus-dependent state through the DMA.  This means you are
allocating memory solely for the purpose of passing the scatter/gather
list up to the device driver.  You need to include the cost of allocating
and freeing that memory as part of the cost of passing the list this
way.

>  > Given that the machine-independent state is only needed for parameter
>  > passing, it seems to me that that there are other possibilities which
>  > eliminate the machine-independent state altogether.  For example,
>  > in bus_dmamap_load() allow the driver to specify two additional
>  > arguments, an opague buffer into which the driver-dependent data
>  > will be formatted and a procedure handle (into the driver code)
>  > which is called with each address/length generated by the s/g code.
> 
> This is actually somewhat expensive ... you're talking about jumping
> through a function pointer every time you translate a kva to a bus
> physcal address.  That also creates some unneeded spaghetti.

Umm, "somewhat expensive" compared to what (this after criticism of
another's failure to measure his assertion)?  In keeping with the "my way"
versus "his way" I compared program `a', which calls a routine to
fill in a 1024 element scatter/gather list and then copies it to a
second list, looping 10,000 times, to `b', which calls the routine
with a function pointer and fills in the second array directly (without
the copy) by multiple function calls.  Compiled them with -O0 -pg.
The results:

a:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.7       2.70     2.70        1  2700.00  3820.00  _main [2]
 29.3       3.82     1.12    10000     0.11     0.11  _foo [3]

b:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 64.5       4.18     4.18                             mcount (8)
 19.0       5.41     1.23    10000     0.12     0.23  _foo [2]
 16.5       6.48     1.07 10240000     0.00     0.00  _bar [4]
  0.0       6.48     0.00        1     0.00  2300.00  _main [3]

That is, it's about 40% faster to call the function and avoid the
copying.  Don't let the profiling overhead fool you, here's the
unprofiled -O0 results:

skank% time ./a
3.797s real  3.776s user  0.019s system  99% ./a
skank% time ./b
2.492s real  2.481s user  0.010s system  99% ./b

More than this, I've underestimated `a's cost by also not including the
malloc() and free() of the first array.

And this is the corner case.  As you pointed out in another message

> Now, if we take a look are the more real work situations, transferring
> a file system block (typically 4k or 8k), on the i386 we're talking
> about a max of 2 scatter/gather descriptors  which in the case of
> 32-bit address/32-bit legth, totals out to 16 bytes.

so to avoid two "somewhat expensive" function calls you are not only doing
a slightly more expensive copy, but also calling malloc() to fetch a 16 byte
chunk of memory to write the results into, and free() to free the memory when
you're done (for buses where you wouldn't otherwise have to allocate
anything to arrange for a transfer)?

In any case, let me restate my problem more succinctly.  Some buses on
some machines require you to carry bus-dependent mapping state through
a DMA, some do not.  For the latter type the only useful thing the four,
or six, function calls you make to do a DMA are going to do is
virtual-to-physical address translation so you can form a device-dependent
scatter/gather request.  What I am suggesting is that the cost of doing
this with your current proposal is going to be largely dominated by the need
to malloc() a buffer for nothing other than parameter passing, and to
free() it afterwards.

I think a better machine-independent DMA interface should not force
implementations for buses which don't require the allocation of bus
resources to allocate and free memory across these calls, nor should
they have to hold preallocated chunks of memory for parameter-passing
purposes.  For buses for which the operations are extraneous it should
be possible to implement everything other than bus_dmamap_load() as
a NOP.  If you don't like the function pointer, maybe you can think
up something else which accomplishes the same goal.

Dennis Ferguson