tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: avoiding bus_dmamap_sync() costs
On Thu, Jul 12, 2012 at 09:05:10PM -0500, David Young wrote:
> At $DAYJOB I am working on a team that is optimizing wm(4).
>
> In an initial pass over the driver, we found that on x86,
> bus_dmamap_sync(9) calls issued some unnecessary LOCK-prefix
> instructions, and those instructions were expensive. Some of the
> locked instructions were redundant---that is, there were effectively
> two in a row---and others were just unnecessary. What we found by
> reading the AMD & Intel processor manuals is that bus_dmamap_sync() can
> be a no-op unless you're doing a _PREREAD or _PREWRITE operation[1].
> _PREREAD and _PREWRITE operations need to flush the store buffer[2].
> The cache-coherency mechanism will take care of the rest. We will
> feed back a patch with these changes and others just as soon as local
> NetBSD-current tree is compilable[3].
>
> In a second pass over the driver, a teammember noted that even with
> the bus_dmamap_sync(9) optimizations already in place, some of the
> LOCK-prefix instructions were still unnecessary. Just for example, take
> this sequence in wm_txintr():
>
> status =
> sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
> if ((status & WTX_ST_DD) == 0) {
> WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
> BUS_DMASYNC_PREREAD);
> break;
> }
>
> Here we are examining the status field of a Tx descriptor and, if we
> find that the descriptor still belongs to the NIC, we synchronize
> the descriptor. It's correct and persuasive code, however, the x86
> implementation will issue a locked instruction that is unnecessary under
> these particular circumstances.
>
> In general, it is necessary on x86 to flush the store buffer on a
> _PREREAD operation so that if we write a word to a DMA-able address and
> subsequently read the same address again, the CPU will not satisfy the
> read with store-buffer content (i.e., the word that we just wrote), but
> with the last word written at that address by any agent.
>
> In these particular circumstances, however, we do not modify the
> DMA-able region, so flushing the store buffer is not necessary.
>
> Let us consider another processor architecture. On some ARM variants,
> the _PREREAD operation is necessary to invalidate the cacheline
> containing the descriptor whose status we just read so that if we come
> back and read it again after a DMA updates the descriptor, content from
> a stale cacheline does not satisfy our read, but actual descriptor
> content does.
>
> One idea that I have for avoiding the unnecessary instruction on x86
> is to add a MI hint to the bus_dmamap_sync(9) API, BUS_DMASYNC_CLEAN.
> The hint tells the MD bus_dma(9) implementation that it may treat the
> DMA region like it has not been written (dirtied) by the CPU. The code
> above would change to this code:
>
> status =
> sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
> if ((status & WTX_ST_DD) == 0) {
> WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
> BUS_DMASYNC_PREREAD);
Oops, line should be:
> BUS_DMASYNC_PREREAD|BUS_DMASYNC_CLEAN);
Dave
--
David Young
dyoung%pobox.com@localhost Urbana, IL (217) 721-9981
Home |
Main Index |
Thread Index |
Old Index