Are there measurements supporting the idea that the BPF pcode engine
is a performance bottleneck?
I didn't do any measurements myself of the whole system because I
don't have access to any test [network] but jitted code is 3-5 times
faster in my benchmark. In absolute values it's IIRC 30ns per packet
on my amd64 and 150-200ns on my Tegra-250 arm.
Then it comes down to, is saving 30ns worh the code size and complexity
(and thus exposure)? I don't know how large a fraction of the cost of
packet handling 30ns is; I suspect it's small enough that I don't
consider it worth the costs. (To pick a number out of thin air, is
handling a packet in 970ns instead of 1us worth it? I think it's not,
and I suspect 1us to handle a packet is highly optimistic, though I
haven't measured that either.)