tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Deferring IP checksum for xennet(4)/xvif(4)


I've been looking on how xennet*/xvif* handles checksums, and how to
avoid computing the checksums in software, possibly deferring this for
the NIC hw to do.

Right now, the virtual network interfaces by default compute and
verify the IP/UDP/TCP checksums for every packet passed between Dom0
and DomU. There is a flag to do Tx offloading, which makes the code to
skip computing the checksum on the transmit side. This however does
not really save any processing, because the receiving side anyway
computes the checksum and then immediately verifies it. All this does
it moving the load from one side to another.

I've done some experiment on how to avoid the compute+verify of
checksum on receiving side. It works well if the packet for Dom0 <->
DomU, but doesn't work if the packet needs to be forwarded or
bridge(4)d, it causes bad checksums in those packets.

What I think happens is this:
0. Forward and bridge code explicitly clear csum_flags, which makes
the mbuf to lose the information that the csum was actually deferred.
1. ip_forward() calls ip_output() which csum_flags == 0, ip_output()
possibly offloads IPv4 checksum, but it won't touch UDP/TCP one.
Result is that packet will be forwarded with bad TCP/UDP checksum,
recipient will see bad checksum and drop the packet.
2. bridge_forward() simply sends the packet to the destination
interface(s) as-is with no checksum, due to csum_flags == 0 no hw
checksum is done, recipient will see bad checksum and drop the packet.

Curiously enough tap(4) leaves csum_flags intact. This is likely a bug.

My first attempt was to modify the code in if_bridge.c to simply pass
the csum_flags compatible with the destination interface, i.e. this:

--- if_bridge.c 24 Feb 2020 00:47:38 -0000 1.168
+++ if_bridge.c 17 Mar 2020 20:56:31 -0000
@@ -1871,7 +1871,11 @@ bridge_forward(struct bridge_softc *sc,
       * clear any in-bound checksum flags to prevent them from being
       * misused as out-bound flags.
-      m->m_pkthdr.csum_flags = 0;
+     m->m_pkthdr.csum_flags &= dst_if->if_csum_flags_tx;

This works, now I have domu communicating with external machina via
bridge0 using wm0 as long as I turn on hw offloading on wm0.

There are at least two problems with this:
1. If destination interface has no hw offloading, it will anyway send
the packet without checksum. I think this is actually OK, it's
responsibility of the admin to only enable the Tx/Rx for xennet if hw
device is hw offload capable.
2. If packet is forwarded from an interface supporting hw offloading
to another one supporting hw offloading, the destination interface
will compute and write the checksum to wrong place, overwriting
typically destination address. This happens because the hw offloading
relies on mbuf csum_data being set with proper checksum offset, and
this information is not passed from input interface. Also very
frequently drivers have no information whether the offload was done
for v4 or v6 packet, passing both M_CSUM_* flags. Destination
interface would then compute even _wrong_ checksum, depending on order
it checks for the flags.

Thus, all this needs to be really explicit. I propose this

1. Add new M_CSUM_BLANK flag, set it only in xennet code
2. Modify ip_forward() and bridge_forward() to do:
    if (__predict_false(m->m_pkthdr.csum_flags & M_CSUM_BLANK)) {
        KASSERT(m->m_pkthdr.csum_data > 0);
        /* If destination interface doesn't support the requested hw
offload, packet will
         * be sent with wrong checksum and dropped by recipient */
      m->m_pkthdr.csum_flags &= dst_if->if_csum_flags_tx;
   } else {
      m->m_pkthdr.csum_flags = 0;

Something like that will also need to happen for gre(4) and gif(4)
too, althrough there it will be necessary to fully undefer the
checksum (i.e. compute it) - there is no hw to do it for us.



P.S. anyone has idea how the vmx(4) offloading works? is it possible
it can benefit from this change too?

Home | Main Index | Thread Index | Old Index