tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: avoiding bus_dmamap_sync() costs



On Thu, Jul 12, 2012 at 09:05:10PM -0500, David Young wrote:
> At $DAYJOB I am working on a team that is optimizing wm(4).
> 
> In an initial pass over the driver, we found that on x86,
> bus_dmamap_sync(9) calls issued some unnecessary LOCK-prefix
> instructions, and those instructions were expensive.  Some of the
> locked instructions were redundant---that is, there were effectively
> two in a row---and others were just unnecessary.  What we found by
> reading the AMD & Intel processor manuals is that bus_dmamap_sync() can
> be a no-op unless you're doing a _PREREAD or _PREWRITE operation[1].
> _PREREAD and _PREWRITE operations need to flush the store buffer[2].
> The cache-coherency mechanism will take care of the rest.  We will
> feed back a patch with these changes and others just as soon as local
> NetBSD-current tree is compilable[3].
> 
> In a second pass over the driver, a teammember noted that even with
> the bus_dmamap_sync(9) optimizations already in place, some of the
> LOCK-prefix instructions were still unnecessary.  Just for example, take
> this sequence in wm_txintr():
> 
>                 status =
>                     sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
>                 if ((status & WTX_ST_DD) == 0) {
>                         WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
>                             BUS_DMASYNC_PREREAD);
>                         break;
>                 }
> 
> Here we are examining the status field of a Tx descriptor and, if we
> find that the descriptor still belongs to the NIC, we synchronize
> the descriptor.  It's correct and persuasive code, however, the x86
> implementation will issue a locked instruction that is unnecessary under
> these particular circumstances.
> 
> In general, it is necessary on x86 to flush the store buffer on a
> _PREREAD operation so that if we write a word to a DMA-able address and
> subsequently read the same address again, the CPU will not satisfy the
> read with store-buffer content (i.e., the word that we just wrote), but
> with the last word written at that address by any agent.
> 
> In these particular circumstances, however, we do not modify the
> DMA-able region, so flushing the store buffer is not necessary.
> 
> Let us consider another processor architecture.  On some ARM variants,
> the _PREREAD operation is necessary to invalidate the cacheline
> containing the descriptor whose status we just read so that if we come
> back and read it again after a DMA updates the descriptor, content from
> a stale cacheline does not satisfy our read, but actual descriptor
> content does.
> 
> One idea that I have for avoiding the unnecessary instruction on x86
> is to add a MI hint to the bus_dmamap_sync(9) API, BUS_DMASYNC_CLEAN.
> The hint tells the MD bus_dma(9) implementation that it may treat the
> DMA region like it has not been written (dirtied) by the CPU.  The code
> above would change to this code:
> 
>                 status =
>                     sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
>                 if ((status & WTX_ST_DD) == 0) {
>                         WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
>                             BUS_DMASYNC_PREREAD);

Oops, line should be:

>                             BUS_DMASYNC_PREREAD|BUS_DMASYNC_CLEAN);

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981


Home | Main Index | Thread Index | Old Index