avoiding bus_dmamap_sync() costs

To: tech-kern%netbsd.org@localhost
Subject: avoiding bus_dmamap_sync() costs
From: David Young <dyoung%pobox.com@localhost>
Date: Thu, 12 Jul 2012 21:05:10 -0500

At $DAYJOB I am working on a team that is optimizing wm(4).

In an initial pass over the driver, we found that on x86,
bus_dmamap_sync(9) calls issued some unnecessary LOCK-prefix
instructions, and those instructions were expensive.  Some of the
locked instructions were redundant---that is, there were effectively
two in a row---and others were just unnecessary.  What we found by
reading the AMD & Intel processor manuals is that bus_dmamap_sync() can
be a no-op unless you're doing a _PREREAD or _PREWRITE operation[1].
_PREREAD and _PREWRITE operations need to flush the store buffer[2].
The cache-coherency mechanism will take care of the rest.  We will
feed back a patch with these changes and others just as soon as local
NetBSD-current tree is compilable[3].

In a second pass over the driver, a teammember noted that even with
the bus_dmamap_sync(9) optimizations already in place, some of the
LOCK-prefix instructions were still unnecessary.  Just for example, take
this sequence in wm_txintr():

                status =
                    sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
                if ((status & WTX_ST_DD) == 0) {
                        WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
                            BUS_DMASYNC_PREREAD);
                        break;
                }

Here we are examining the status field of a Tx descriptor and, if we
find that the descriptor still belongs to the NIC, we synchronize
the descriptor.  It's correct and persuasive code, however, the x86
implementation will issue a locked instruction that is unnecessary under
these particular circumstances.

In general, it is necessary on x86 to flush the store buffer on a
_PREREAD operation so that if we write a word to a DMA-able address and
subsequently read the same address again, the CPU will not satisfy the
read with store-buffer content (i.e., the word that we just wrote), but
with the last word written at that address by any agent.

In these particular circumstances, however, we do not modify the
DMA-able region, so flushing the store buffer is not necessary.

Let us consider another processor architecture.  On some ARM variants,
the _PREREAD operation is necessary to invalidate the cacheline
containing the descriptor whose status we just read so that if we come
back and read it again after a DMA updates the descriptor, content from
a stale cacheline does not satisfy our read, but actual descriptor
content does.

One idea that I have for avoiding the unnecessary instruction on x86
is to add a MI hint to the bus_dmamap_sync(9) API, BUS_DMASYNC_CLEAN.
The hint tells the MD bus_dma(9) implementation that it may treat the
DMA region like it has not been written (dirtied) by the CPU.  The code
above would change to this code:

                status =
                    sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
                if ((status & WTX_ST_DD) == 0) {
                        WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
                            BUS_DMASYNC_PREREAD);
                        break;
                }

And the x86 implementation of bus_dmamap_sync() would just skip the
locked instruction if BUS_DMASYNC_CLEAN was in the flags.

***

Here is another example where x86 will unnecessarily issue
a costly locked instruction.  This example comes from wm_start():

                /* Sync the descriptors we're using. */
                WM_CDTXSYNC(sc, sc->sc_txnext, txs->txs_ndesc,
                    BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);

                /* Give the packet to the chip. */
                CSR_WRITE(sc, sc->sc_tdt_reg, nexttx);

In this case, we do need to flush the store buffer at the WM_CDTXSYNC()
call.  On x86, however, we can rely on the CSR_WRITE() to flush the
store buffer: an x86 IN or OUT instruction will flush the store buffer.
So will memory-mapped I/O: on x86, writes will reach global visibility
in program order, and a strong-ordering rule applies to uncached
accesses such as register reads/writes, so the register write will flush
the store buffer.  (I leave it as an open question when the register
write will hit the bus.)

For cases such as these, where it will suffice to sync a DMA  
region before a subsequent register access, how about a new
MI routine, bus_dmamap_barrier(), that orders DMA synchronization
with register access:

void
bus_dmamap_barrier(bus_dma_tag_t dmat, bus_space_tag_t st, bus_dmamap_t dmam,
    bus_addr_t offset, bus_size_t len, int ops);

bus_dmamap_barrier() works alike to bus_dmamap_sync() except that `ops'
specifies which synchronization must occur before a subsequent register
access via `st'.  `ops' is one of:

        BUS_DMA_BARRIER_BEFORE_REGRD(dmaops)    // before register read
        BUS_DMA_BARRIER_BEFORE_REGWR(dmaops)    // before register write
        BUS_DMA_BARRIER_BEFORE_REGRW(dmaops)    // before register read or write

where `dmaops' is any valid combination of bus_dmamap_sync() operations.

If the register access indicated by the `ops' argument won't
perform the DMA synchronization (`dmaops') as a side effect,
then bus_dmamap_barrier() just has to do the equivalent of a
bus_dmamap_sync(..., dmaops).

So that's my current thinking about bus_dma(9).  Please let me know
your thoughts.

Dave

[1] Or bounce buffers are involved.

[2] Store buffer is Intel terminology.  Write buffer is AMD terminology
    for the same thing.

[3] RUMP is unpopular at $DAYJOB for various reasons.  One reason
    is that there is not a MKRUMP option for disabling it, so it is
    necessary to wait for it to build and install even if it isn't
    wanted.  Another reason is that sometimes changes made to the kernel
    have to be replicated in RUMP, and having to double any effort
    is both expensive and demoralizing.  Please don't read this as a
    criticism of RUMP overall, just a wish for some improvements in
    modularity and code sharing.

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981

Follow-Ups:
- Re: avoiding bus_dmamap_sync() costs
  - From: David Laight
- Re: avoiding bus_dmamap_sync() costs
  - From: Roy Marples
- Re: avoiding bus_dmamap_sync() costs
  - From: David Young

Prev by Date: Re: Quota on tmpfs
Next by Date: Re: avoiding bus_dmamap_sync() costs
Previous by Thread: PUFFS lookup/reclaim race
Next by Thread: Re: avoiding bus_dmamap_sync() costs
Indexes:

Home | Main Index | Thread Index | Old Index