tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: tlp(4) DMA synchronization



dyoung%pobox.com@localhost wrote:

> > > According to the documentation, we cannot count on BUS_DMA_COHERENT to
> > > do anything, so the ops are always required. :-)
> > 
> > Yes, we should always call sync ops after touching DMA descriptors.
> > But in fact a few drivers do it properly, and it means
> > most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).
> 
> The drivers rely on BUS_DMA_COHERENT, and BUS_DMA_COHERENT cannot be
> relied on.  It is a sad state of affairs. :-/

Yes. The problem is tradeoff among hardware cost,
software complexity, and total performance, but
it looks non-coherent DMA systems often have trouble,
for example:
http://mail-index.NetBSD.org/port-sgimips/2000/06/29/0006.html

> > Hmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?
> 
> I don't know.  You mentioned wm(4) and re(4).  Do all of the ports where
> those drivers will break without BUS_DMA_COHERENT provide the uncached
> mappings?

On ports whose cache systems don't handle DMA (by bus-snoop etc.), yes.
At least they had troubles on O2:
http://mail-index.NetBSD.org/port-sgimips/2008/01/20/msg000022.html
http://mail-index.NetBSD.org/source-changes/2006/10/20/msg176308.html
(though the problem is cachelinesize vs descsize mentioned below,
 not driver itself)

> > > I think that in principle, the host can use ring mode if does not reuse
> > > a descriptor until after the NIC has relinquished every other descriptor
> > > in the same cacheline.
> > 
> > Consider the following scenario:
> > 
> > (1) rxdescs[0].td_status in rxintr is polled and cached
> > (2) the received packet for rxdescs[0] is handled
> > (3) rxdescs[0] data in cacheline is updated for the next RX op
> >     in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
> > (4) rxdescs[0] data in the cacheline is written back and invalidated
> >     by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()
> > 
> > If the cachelinesize is larger than sizeof rxdescs
> > (i.e. the same cacheline also fetches rxdescs[1])
> > and rxdescs[1] for the next descriptor is being updated
> > (to clear TDSTAT_OWN) by the device between (1) and (4),
> > the updated data will be lost by the writeback op at (4).
> > We can put a PREREAD sync op before (3), but race could still
> > happen between (3) and (4) by write allocate at (3).
> 
> That is just the scenario that I had in mind.  I think that we can use
> ring mode and avoid that scenario, if we postpone step (3) until the NIC
> is finished with the rest of the Rx descriptors in the same cacheline,
> rxdescs[1] through rxdescs[descs_per_cacheline - 1].

Hmm, it might work on RX, which uses one descriptor per packet.
On the other hand, TX packets might use multiple descs to handle
fragmentation (which would not be a multiple of descs_per_cacheline),
so I'm not sure if we can handle it properly.

> > In real world, we need the following changes:
> > 
> > - prepare a new MI API which returns maximum cache line size
> >   for each architecture, at least on ports which have bus_dma(9)
> 
> I think that the *maximum* cacheline size could be a compile-time
> MI constant.  Then we can avoid a lot of complicated code by using either
> something like this,
> 
> struct tulip_desc {
>       /* ... */
> } __packed __aligned(MAX_CACHE_LINE_SIZE);
> 
> or something like this,
> 
> struct proto_tulip_desc {
>       /* ... descriptor fields ... */
>       uint8_t td_pad;
> };
> 
> struct tulip_desc {
>       /* ... descriptor fields ... */
>       uint8_t td_pad[MAX_CACHE_LINE_SIZE -
>                      offsetof(struct proto_tulip_desc, td_pad)];
> } __packed __aligned(4);
> 
> Either way we do it, I think that it avoids the complexity of the
> following, what do you think?

Hmm, I put the similar code in sys/arch/cobalt/stand/boot/tlp.c,
but in MI drivers there is one concern, how large the possbile
MAX_CACHE_LINE_SIZE is.

On mips, the cacheline size can be 128 bytes, while most systems
use 32 bytes. Wasting 128 byte DMA safe memory for 16 byte descs
might be problematic on some ports, because such memory could be
limited resource and bus_dmamem_alloc(9) might fail for too large
segments, especially on attaching devices on running systems which
could have less physically contiguous memory than at boot time.

ex(4) uses more DMA memory for descriptors even without alignment
(IIRC it's >64KB), but it already has a problem on hotswap:
http://www.NeTBSD.org/cgi-bin/query-pr-single.pl?number=10734

In tlp(4) case, NTXDESC is 1024 (== 64 * 16) and NRXDESC is 64,
so using 128byte per descs consumes >128KB DMA safe memory.
(we could use non-contiguous pages on chained mode though)

> > (note iee(4) which uses direct DMA with the complex sync ops seems
> >  slower than old ie(4) which uses fixed DMA buffer and copies on hp700)
> 
> Just wondering aloud: will performance improve on all architectures if
> we avoid such cacheline interference as leads to the dangerous race
> condition on the non-coherent architectures?  For example, will i386
> benefit?  IIUC, an i386 CPU watches the bus for bus-master access to
> memory regions covered by active cache lines, and it writes back a dirty
> line or discards a clean line ahead of bus-master access to it.  If the
> CPU writes to lines where the bus-master still owns some descriptors,
> then there will be more write-backs than if the driver is programmed
> never to write those lines.  Those write-backs come with a cost in
> memory bandwidth; if the bus-master has to retry its access, there may
> be additional latency, too.

Well, I don't have evidence which operations
(cache flush ops, desc ops by software, bus arbitration by hw etc.)
could be bottleneck. (modern hardware is fast and optimized enough)

Nowadays most hardware designers consider about only x86 systems
which don't have any DMA coherent issue (that should be handled
by CPU or chipset), so I guess few people actually consider about
performance around bus-snoop or arbitration ops vs DMA descriptor
alignments. Anyway, we need proper benchmarks for each implementation.

---
Izumi Tsutsui


Home | Main Index | Thread Index | Old Index