tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: tlp(4) DMA synchronization



dyoung%pobox.com@localhost wrote:

> > These ops are required on systems which don't handle BUS_DMA_COHERENT.
> 
> According to the documentation, we cannot count on BUS_DMA_COHERENT to
> do anything, so the ops are always required. :-)

Yes, we should always call sync ops after touching DMA descriptors.
But in fact a few drivers do it properly, and it means
most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).

For example, wm(4) and re(4) won't work without BUS_DMA_COHERENT
on sgimips O2, even though the latter may have all necessary
bus_dmamap_sync(9) calls (I believe ;-).

> > But strictly speaking, on such system we'd have to use chained mode
> > (not ring mode) with proper padded between each DMA descriptor,
> > i.e. one descriptor per cacheline to avoid host vs DMA race.
> 
> You're right, of course.  I have seen these races occur on architectures
> that we aim to support, such as ARM.

Hmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?

> I think that in principle, the host can use ring mode if does not reuse
> a descriptor until after the NIC has relinquished every other descriptor
> in the same cacheline.

Consider the following scenario:

(1) rxdescs[0].td_status in rxintr is polled and cached
(2) the received packet for rxdescs[0] is handled
(3) rxdescs[0] data in cacheline is updated for the next RX op
    in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
(4) rxdescs[0] data in the cacheline is written back and invalidated
    by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()

If the cachelinesize is larger than sizeof rxdescs
(i.e. the same cacheline also fetches rxdescs[1])
and rxdescs[1] for the next descriptor is being updated
(to clear TDSTAT_OWN) by the device between (1) and (4),
the updated data will be lost by the writeback op at (4).
We can put a PREREAD sync op before (3), but race could still
happen between (3) and (4) by write allocate at (3).

I think this could happen in usual operations because
the next RX packet will soon be received right after
RX interrupt caused by the previous packet.

> We may be able to use cacheline-aligned descriptors in ring mode if the
> chip respects the Descriptor Skip Length (DSL) field of the Bus Mode
> Register.  According to the datasheet, the DSL field "Specifies the
> number of longwords to skip between two unchained descriptors."
> 
> What do you think?

In the perfect world, it should work ;-)
(though I have not checked the datasheet yet)

In real world, we need the following changes:

- prepare a new MI API which returns maximum cache line size
  for each architecture, at least on ports which have bus_dma(9)
- store the cache line size value into tlp_softc
- calculate size of whole TX/RX descriptors (and memory for setup packets)
  dynamically on attach
  (we can't use static sizeof(struct tulip_control_data))
- replace all macro which use offsetof(struct tulip_control_data, x)
  to get region of each descriptor, including all sync ops
- for tlp(4) specifc problem, make sure all dumb clones support
  chained mode or DSL field properly

One example of such implementation is in sys/dev/ic/i82596.c for iee(4),
but I'm afraid it's much easier to implement machine dependent
uncached mapping support for BUS_DMA_COHERENT if hardware supports it.
I guess that's the reason why many drivers still lack proper
bus_dmamap_sync(9) calls for DMA descriptors even nowadays.

(note iee(4) which uses direct DMA with the complex sync ops seems
 slower than old ie(4) which uses fixed DMA buffer and copies on hp700)
---
Izumi Tsutsui


Home | Main Index | Thread Index | Old Index