Re: tlp(4) DMA synchronization

To: tech-net%NetBSD.org@localhost
Subject: Re: tlp(4) DMA synchronization
From: David Young <dyoung%pobox.com@localhost>
Date: Fri, 28 Aug 2009 12:34:45 -0500

On Fri, Aug 28, 2009 at 08:29:21PM +0900, Izumi Tsutsui wrote:
> dyoung%pobox.com@localhost wrote:
> 
> > > These ops are required on systems which don't handle BUS_DMA_COHERENT.
> > 
> > According to the documentation, we cannot count on BUS_DMA_COHERENT to
> > do anything, so the ops are always required. :-)
> 
> Yes, we should always call sync ops after touching DMA descriptors.
> But in fact a few drivers do it properly, and it means
> most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).

The drivers rely on BUS_DMA_COHERENT, and BUS_DMA_COHERENT cannot be
relied on.  It is a sad state of affairs. :-/

> > > But strictly speaking, on such system we'd have to use chained mode
> > > (not ring mode) with proper padded between each DMA descriptor,
> > > i.e. one descriptor per cacheline to avoid host vs DMA race.
> > 
> > You're right, of course.  I have seen these races occur on architectures
> > that we aim to support, such as ARM.
> 
> Hmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?

I don't know.  You mentioned wm(4) and re(4).  Do all of the ports where
those drivers will break without BUS_DMA_COHERENT provide the uncached
mappings?

> > I think that in principle, the host can use ring mode if does not reuse
> > a descriptor until after the NIC has relinquished every other descriptor
> > in the same cacheline.
> 
> Consider the following scenario:
> 
> (1) rxdescs[0].td_status in rxintr is polled and cached
> (2) the received packet for rxdescs[0] is handled
> (3) rxdescs[0] data in cacheline is updated for the next RX op
>     in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
> (4) rxdescs[0] data in the cacheline is written back and invalidated
>     by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()
> 
> If the cachelinesize is larger than sizeof rxdescs
> (i.e. the same cacheline also fetches rxdescs[1])
> and rxdescs[1] for the next descriptor is being updated
> (to clear TDSTAT_OWN) by the device between (1) and (4),
> the updated data will be lost by the writeback op at (4).
> We can put a PREREAD sync op before (3), but race could still
> happen between (3) and (4) by write allocate at (3).

That is just the scenario that I had in mind.  I think that we can use
ring mode and avoid that scenario, if we postpone step (3) until the NIC
is finished with the rest of the Rx descriptors in the same cacheline,
rxdescs[1] through rxdescs[descs_per_cacheline - 1].

> I think this could happen in usual operations because
> the next RX packet will soon be received right after
> RX interrupt caused by the previous packet.

Oh, I know from experience that it does! :-)

> > We may be able to use cacheline-aligned descriptors in ring mode if the
> > chip respects the Descriptor Skip Length (DSL) field of the Bus Mode
> > Register.  According to the datasheet, the DSL field "Specifies the
> > number of longwords to skip between two unchained descriptors."
> > 
> > What do you think?
> 
> In the perfect world, it should work ;-)
> (though I have not checked the datasheet yet)
> 
> In real world, we need the following changes:
> 
> - prepare a new MI API which returns maximum cache line size
>   for each architecture, at least on ports which have bus_dma(9)

I think that the *maximum* cacheline size could be a compile-time
MI constant.  Then we can avoid a lot of complicated code by using either
something like this,

struct tulip_desc {
        /* ... */
} __packed __aligned(MAX_CACHE_LINE_SIZE);

or something like this,

struct proto_tulip_desc {
        /* ... descriptor fields ... */
        uint8_t td_pad;
};

struct tulip_desc {
        /* ... descriptor fields ... */
        uint8_t td_pad[MAX_CACHE_LINE_SIZE -
                       offsetof(struct proto_tulip_desc, td_pad)];
} __packed __aligned(4);

Either way we do it, I think that it avoids the complexity of the
following, what do you think?

> - store the cache line size value into tlp_softc
> - calculate size of whole TX/RX descriptors (and memory for setup packets)
>   dynamically on attach
>   (we can't use static sizeof(struct tulip_control_data))
> - replace all macro which use offsetof(struct tulip_control_data, x)
>   to get region of each descriptor, including all sync ops
> - for tlp(4) specifc problem, make sure all dumb clones support
>   chained mode or DSL field properly

> (note iee(4) which uses direct DMA with the complex sync ops seems
>  slower than old ie(4) which uses fixed DMA buffer and copies on hp700)

Just wondering aloud: will performance improve on all architectures if
we avoid such cacheline interference as leads to the dangerous race
condition on the non-coherent architectures?  For example, will i386
benefit?  IIUC, an i386 CPU watches the bus for bus-master access to
memory regions covered by active cache lines, and it writes back a dirty
line or discards a clean line ahead of bus-master access to it.  If the
CPU writes to lines where the bus-master still owns some descriptors,
then there will be more write-backs than if the driver is programmed
never to write those lines.  Those write-backs come with a cost in
memory bandwidth; if the bus-master has to retry its access, there may
be additional latency, too.

Dave

-- 
David Young             OJC Technologies
dyoung%ojctech.com@localhost      Urbana, IL * (217) 278-3933

Follow-Ups:
- Re: tlp(4) DMA synchronization
  - From: David Laight
- Re: tlp(4) DMA synchronization
  - From: Izumi Tsutsui

References:
- Re: tlp(4) DMA synchronization
  - From: David Young
- Re: tlp(4) DMA synchronization
  - From: Izumi Tsutsui

Prev by Date: Re: tlp(4) DMA synchronization
Next by Date: Re: tlp(4) DMA synchronization
Previous by Thread: Re: tlp(4) DMA synchronization
Next by Thread: Re: tlp(4) DMA synchronization
Indexes:

Home | Main Index | Thread Index | Old Index