tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: RFC: softint-based if_input



On Thu, Jan 28, 2016 at 12:17 AM, Taylor R Campbell
<campbell+netbsd-tech-kern%mumble.net@localhost> wrote:
>    Date: Wed, 27 Jan 2016 16:51:22 +0900
>    From: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
>
>    Here it is: http://www.netbsd.org/~ozaki-r/softint-if_input-ifqueue.diff
>
>    Results of performance measurements of it are also added to
>    https://gist.github.com/ozaki-r/975b06216a54a084debc
>
>    The results are good but bothers me; it achieves better performance
>    than vanilla (and the 1st implementation) on high load (IP forwarding).
>    For fast forward, it also beats the 1st one.
>
>    I thought that holding splnet during ifp->if_input (splnet is needed
>    for ifqueue operations and so keep holding in the patch) might affect
>    the results. So I tried to release during ifp->if_input but the results
>    didn't change so much (the result of IP forwarding is still better
>    than vanilla).
>
>    Anyone have any ideas?
>
> Here's a wild guess: with vanilla, each CPU does
>
>         wm_rxeof loop iteration
>         if_input processing
>         wm_rxeof loop iteration
>         if_input processing
>         ...
>
> back and forth.  With softint-rx-ifq, each CPU does
>
>         wm_rxeof loop iteration
>         wm_rxeof loop iteration
>         ...
>         if_input processing
>         if_input processing
>         ...
>
> because softint processing is blocked until the hardintr handler
> completes.  So vanilla might make less efficient use of the CPU cache,
> and vanilla might leave the rxq full for longer so that the device
> cannot fill it as quickly with incoming packets.

That might be true. If so, the real question may be why the old
implementation isn't efficient compared to the new one.

>
> Another experiment that might be worthwhile is to bind the interrupt
> to a specific CPU, and then use splnet instead of WM_RX_LOCK to avoid
> acquiring and releasing a lock for each packet.

In the measurements, all interrupts are already delivered to CPU#0.
Removing the lock doesn't change the results. I guess acquiring and
releasing a lock (w/o contentions) are low overhead. Note that
wm has a RX lock per HW queue, so RX processing can be done with no
lock contention basically.

>  (On Intel >=Haswell,
> we should use transactional memory to avoid bus traffic for that
> anyway (and maybe invent an MD pcq(9) that does the same).  But the
> experiment with wm(4) is easier, and not everyone has transactional
> memory.)

How does transactional memory help?

  ozaki-r


Home | Main Index | Thread Index | Old Index