tech-net: Re: Network driver receive path

Subject: Re: Network driver receive path
To: Maen Suleiman <maen.suleiman@gmail.com>
From: None <jonathan@dsg.stanford.edu>
List: tech-net
Date: 03/14/2007 10:28:45
In message <9c1cad6e0703141024n6c8385dei3058e73a73a0e696@mail.gmail.com>,
"Maen Suleiman" writes:
>Hi,
>
>I am trying to tune our giga driver performance,

Is a "giga" a gigabit Ethernet interface?

>I have noticed that
>the system spends 57% of the time on interrupts when we do an oriented
>receive test, while the system spends only 20% of the time on
>interrupts when we do an oriented send test.
>
>From the profiler results , we understood that the main reason of
>spending this time on the RX interrupt was because of the
>MGETHDR,MCLGET and bus_dmamap_load , and mainly because of the
>bus_dmamap_load function.

Are your tests sustaining the same (or closely comparable) throughput?
If so, then your driver is DMA-mapping roughly the same amount of data
for both transmit and receive.  Again, if so, that 'd tend to suggest
the problem is interrupt rate on receive side, rather than transmit
side.  The fix for *that* is to use interrupt mitigation, if you can.

On the other hand, if you are confident in your profile data pointing
to bus_dmamap_load, perhaps the DMA map for receive data really is
significantly more expensive (per packet), than for TX data.  At a
wild guess, perhaps Rx incurs more work than for Tx (e.g., forcing
lines of cached data from the CPU cache out into main memory?)

>The problem is that we couldn't find an alternative of allocating
>mbufs and calling bus_dmamap_load in the RX interrupt,!
>
>Will using a task to do the mbuf handling help ?

Nope, not at this time.  And in general, probably not for any
single-CPU system: you're doing the same work, plus adding some
context-switch overhead.

[... reordered...]

>Is there a way to allocate a constant physical memory block for the RX
>DMA , and then using this block for the mbufs that will be delivered
>to the stack? In this case I must know when the TCP stack has finished
>handling the mbuf, and then I will re-use the same memory physical
>space!

Not really, not in any MI way in NetBSD. bus_dma(9) does include a
"BUS_DMA_COHERENT" mapping, but it's documented as being a "hint" to
(machine-dependent) implementations of bus_dma(9); portable NetBSD
drivers still have to issue appropriate bus_dma_sync() calls.

>Is there a way to tell the TCP stack to give me back the mbuf that was
>delivered to it, and then I can re-use the same mbufs without calling
>bus_dmamap_load?

Not for mbufs, not really.

For mbuf *clusters* you could implement a driver-private mbuf cluster
pool, backed by normal DMA mechanisms.  You _could_ then attempt some
machine-dependent violations of the machine-independent API, based on
your own knowledge of your CPU and private memory pool; but such a
driver wouldn't work on other ports of NetBSD to other CPU architectures
(e.g., those which have IOMMUs and therefore rely on drivers following the documented
bus_dma(9) API for correct operation.


If possible, a better approach might be to extend the bus_dma(9)
implementation and mbuf-cluster information, to attempt to cache and
reuse more information, to avoid (for example) repeated
KVA-to-physical mappings if you reuse the same physical
addresses. That's likely to be a big undertaking, and I'd suggest some
close discussion with Jason Thorpe before going down that route.

But my guess is, you really need to find, and discuss options with,
someone who understands both the bus_dma(9) backend for your CPU
(ARM?)  and your non-PCI "giga" device.