Subject: Re: FreeBSD Bus DMA (was Re: AdvanSys board support)
To: Jonathan Stone <jonathan@DSG.Stanford.EDU>
From: Justin T. Gibbs <gibbs@plutotech.com>
List: tech-kern
Date: 06/11/1998 11:30:37
>
>Justin,
>
>I dont see how this answers the specific case I asked about at all.
>Let me try again.  Suppose we have the following fictional case, based
>on closely a real one with some names changed to somethign you may
>be more familiar with.
>
>Suppose we have an Alpha CPU with a physical address space greather
>than 4 Gbytes.  The Alpha CPU has an PCI bus attachment.  To DMA any
>datum over the PCI bus, any datum *at all*, we must set a hardware
>mapping register _in the bus controller_. The PCI bridge has mapping
>registers for each 8K page of PCI memory space. Each register maps an
>8K page of bus-address-space used for the request (on the PCI bus) to
>a specific system memory address.  If that mapping isnt' set up ahead
>of time, we get a bus error and/or a machinecheck.  This bus-adaptor
>mapping is indepndent and orthogonal to any "scatter/gather" mapping
>that goes on in the host.

Sure.

>The real example I'm asking about is even worse:
>the topology  looks like
>
>    CPU <-- bridge 1 --> 32-bit bus <--bridge 2 -->  16-bit bus
>
>where there are two independent mappings, with different pagesizes,
>one on each of the bridges.  Imagine that there's an ISA bridge on the
>far side of the PCI bus, and that ISA bridge has its *own* mapping
>registers, for 4k pages, for its entire 16meg DMA space. Again, no DMA
>to or from ISA space is possible without setting up a mapping from ISA
>address to PCI address, and from that PCI address to system addresses.

No problem.

>On systems like this, you can't _ever_ get away with using a
>"default map".

For transactions to targets on that bus, yes, this is true.  Is it the case
for all pathways to all devices on all busses that use bus dma in the 
system?  Why preclude the optimization in cases that can use it if it 
doesn't prevent you from handling the other cases?  This seems like a 
simple software design principle to me.

>Perhaps the special case of a linear, no-op mapping from bus addresses
>to system memory addresses comes up often enough that it's worth
>optimizing for.

It happens in all cases but to an ISA bus mastering device in a system 
with more than 16MB of memory for x86 platforms.

>But as far as I can see, changing the API to assume
>there's a "default object" and to use callbacks, just doesn't work.

Where does the API indicate that a "default object" is in use?  The point
is that the objects are opaque to the client.  They could receive a shared
object or an object specially tailored to the individual clients needs.
I only indicated how one implementation for a particular type of I/O path
was able to take advantage of this "opaqueness" to reduce resource 
utilization and simplify the implementation.

You still haven't pointed out how a callback precludes you from performing 
a mapping in any case.  I have pointed out how it saves memory resources
and still maintain that it does nothing to damage the portability of the
interface.

>>> Am I misunderstanding what you mean by
>>>hierarchy?
>
>>Most likely.  Perhaps my example above clears this up.
>
>No, if anything it makes me more confident that I'm not misunderstanding.
>One basic problem I keep coming back to here is bus adaptors which
>require DMA-mapping setup. I havent seen you address those cases.
>Am I missing something?

Yes.  You are confusing API and implementation.  How you setup the hardware
to make the mapping occur is irrelevant to the client so long as you honor
the restrictions it imposed when creating it's tag.

>>In order to cut the space, you must move to a callback to win in all cases.
>
>Yes. But there are machines where the above _has_ to be done, for all
>transfers, even for devices where (in your worldview) there 
>
>	   "is no S/G map"
>
>and so the information to construct the relveant mappings simply
>doesn't exist.

We're talking around each other.  Let's try this again.

No mater what type of mapping operation you perform, there is always an
S/G map provided to the client.  In some cases, that map may contain only
a single entry, but it is still information that the client requires in
order to get it's job done.

The bus dma implementation is provided an object (buffer, mbuf, etc.) to 
map into the client's world view and to construct an S/G list indicating
the mapping of that buffer.  It is also provided a tag and dma map object
that the implementation constructed to match the clients specified 
constraints.  During tag construction, map construction, or load 
operations, the implementation is free to allocate and store whatever data
it sees fit in these opaque objects.  It can tweak mapping registers.  It
can bounce data.  It can do whatever it needs to do in order to satisfy the
clients request.

When the mapping request is satisfied, the client's callback function is
called with an S/G list of one or more elements.  Although the 
implementation may store this list in a tag, dma map, or other static
object, the client is not allowed to assume this.  If the client needs
this address information outside the lifetime of it's callback, it must
retain it in a client specified manor.

>I see two problems here:
>
>1) In your world, (again, if I understand it correctly) the device
>driver can decice that since _it_ doesn't need a map for S/G purposes,
>it needn't construct one at all.  As above, this just doesnt work.

Not at all.  The device driver always constructs a map.  It simply doesn't
know what that object contains.  It simply supplies the map and tag it
was provided by the appropriate bus dma allocation routines to the APIs
that require them.  What the implementation stuffs in these opaque objects
is the implementations business.

>I don't see any way to square that with the original claim, that (in
>comparison to your reworked system), the NetBSD interface
>
>    }sacrifices speed and memory resources for absolutely no gain in
>    }portability.
>
>Am I missing something here?

Yes.

>2) I just don't think the callbacks really cut it.  You're trading
>   space for time, on the assumption that most of the time, the
>   address mapping required by the bus adaptor (e.g., host bridge) is
>   the identity mapping (system memory addresses and bus addresses for
>   DMA are the same).  That may be a good assumption for x86es, but
>   it's just not a valid assumption for the machines NetBSD runs on.

No.  You've completely missed the point.  My assumption is that, in the
common case of mapping a buffer or mbuf into bus space, the S/G list will
be on the order of 1 element per page of mapped data.  This adds up to
a significant amount of space (2k of S/G list for a 1MB mapping assuming
4k page size and 32bit address and count) for each transaction in a format
that most clients can only use to convert to a more convenient format.


>>The
>>AdvanSys controllers, for instance, simply PIO their S/G list directly to
>>the card (not a great design, but that's what it is) so no static storage
>>of any type is wanted.
>
>On an x86, perhaps. But there are other machines where static storage
>_is_ needed, because even this kind of device *Just Wont Work* unless
>you also set up mapping registers in the CPU-to-IObus bridge, or
>bridges, with the DMA address used by the transfer. And, possibly,
>tear them down when it's done.
>
>(sorry to keep hitting it, but I did say this before, but the point
>seems not to have gotten through.)

You are confusing the lifetime of the mapping with the lifetime of the
data structure that exports that mapping in the format of an S/G list.
The two are completely independent.

>> If you are willing to force a single S/G copy, you
>>would have to export the S/G list format in some way into the MI code so
>>that it could be constructed properly.  This could turn nasty.
>
>Yes.  As i keep saying, there are systems where you have *no choice*
>but to do this. And it could indeed turn very nasty, especially if the
>I/O topology is such that you need to walk the dmamap more than
>once. I know of two or three sytems that _need_ that, just off the top
>of my head.  I think in your design, that just doesn't work at all
>(due to the "lifetime" restrictions of the callback.  Is that right,
>or did I read too fast?
>
>So, i think the right way to avoid the "nastiness" is to live with the
>MI representation of a dmamap.

Uhh.  The dma map in NetBSD is not MI.  A portion of it, containing the S/G
list and some other info is MI, but the rest is at the discretion of the
particular implementation for that bus dma path.

I think you are really confused about what I'm trying to say.

>(BTW, the dmamap isn't a "S/G". It can be _used_ for that, but it may
>also be needed for setting mapping registers in a host bus adaptor.
>Which seem to be outside your definition of when an "S/G" is
>necessary.  Maybe I'm wrong, but could usign the "S/G" terminology be
>clouding some of these issues?)

You somehow believed that the FreeBSD implementation wasn't allocating
dma maps.  The client always asks for a dma map to be allocated on it's
behalf.  What the dma map represents is up to the implentation.

>>The CAM SCSI layer is quite paranoid about keeping the order of
>>transactions the same as that specified by the client.  
>
>So the SCSI CAM layer blocks?  Good for it.  But my question was
>specifically asking about network interfaces and the network
>subsystem.  If the SCSI drivers or NIC drivers block, they could use
>the WAITOK flag to NetBSD's bus_dma interface and so serialize their
>memory requrements.  That could be achieved inside the bus-dma layer
>for a given host/bus combination. What is it I'm missing here?

You're missing requests that occur from an interrupt context.  This
happens all the time in both the networking code and CAM.

>Aside: 
>
>I'm a bit puzzled about the 256k for a 1Mbyte transfer with a 1542,
>and the claim that FreeBSD cuts that in half.  I get 8 bytes of DMAmap
>per 4k page, which at 256 entries for 1 Mbyte, comes to 64k
>bytes. that doesn't seem like a horrendous overhead, for that big a
>transfer, on top of what the controller itself needs.  (and note that
>in this case the dmamap need't be in <= 16M memory).

I miss spoke before.  I was referencing the ahc driver which uses an 8 byte
S/G entry (currently, may move to 64 bit pointers for the cards that can
handle dual address PCI cycles) and that mapping a 1MB buffer would require
256 entries on a system with a 4k page size.  This would mean 2k in the dma
map object and 2k in the ahc driver's own data structures per transaction.
Multiply that out by 255 possible concurrent transactions (pretty easy to
achieve with CAM), and you have 512k of S/G space in FreeBSD vs. 1MB of S/G
space in NetBSD.  Mapping large objects will become more and more critical
as network and disk subsystems increase in speed, so wanting to map a 1MB
buffer is not that outlandish.

--
Justin