tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

BPF64: proposal of platform-independent hardware-friendly backwards-compatible eBPF alternative



Hello!

                                         We don't need ELF relocations!
                                           We want better loop control!
                                               No so little parameters,
				        Verifier! Leave our code alone!
                                                          -- Ping Floyd

I've recently had some experience with Linux's ePBF in it's XDP, and this left
quite negative impression. I was following via https://github.com/xdp-project/xdp-tutorial
and after 3rd lesson was trying to create a simple program for searching TCP
timestamp option and incrementing it by one. As you know, eBPF tool stack
consists of at least clang and eBPF verifier in the kernel, and after two dozen
tries eBPF verifier still didn't accept my code. I was digging into verifier
sources, and the abysses opened in front of me! Carefully and boringly going
via disassembler and verifier output, I've found that clang optimizer ignores
just checked register - patching one byte in assembler sources (and target .o)
did help. I've filed https://github.com/iovisor/bcc/issues/5062 with details
if one curious.

So, looking at eBPF ecosystem, I must say it's a Frankenstein. Sewn from good,
sometimes brilliant parts, it's a monster in outcome. Verifier is in it's own
right, compiler/optimizer is in it's own right... But at the end you even
don't have a high-level programming language! You must write in C, relatively
low-level C, and restricted subset of C. This requires very skilled
professionals - it's far from something not even user-friendly, but at least
sysadmin-friendly, like `ipfw` or `iptables` firewall rules.

Thus I looked at the foundation of eBPF architecture, with which presuppositions
in mind it was created with. In fact, it tries to be just usual programming
after checks - that is, with all that pointers. It's too x86-centric and
Linux-centric - number of registers was added just ten. So if you look at the
GitHub ticket above, when I tried to add debug to program - you know, just
specific `printf()`s - it failed verifier checks again because compiler now
had to move some variables between registers and memory, as there is limit on
just 5 arguments to call due to limit of 5 registers! And verifier, despite
being more than 20,000 lines of code, still was not smart enough to track info
between registers and stack.

So, if we'd started from beginning, what should we do? Remember classic BPF:
it has very simple validator due to it's Virtual Machine design - only forward
jumps, checks for packet boundaries at runtime, etc. You'd say eBPF tries for
performance if verifier's checks were passed? But in practice you have to toss
in as much packet boundary checks as near to actual access as possible, or
verifier may "forget" it, because of compiler optimizer. So this is not of
much difference for checking if access is after packet in classic BPF - the
same CMP/JUMP in JIT if buffer is linear, and if your OS has put packet in
several buffers, like *BSD or DPDK `mbuf`'s, the runtime check overhead is
negligible in comparison.

Ensuring kernel stability? Just don't allow arbitrary pointers, like original BPF.
Guaranteed termination time? It's possible if you place some restrictions. For
example, don't allow backward jumps but allow function calls - in case of
stack overflow, terminate program. Really need backward jumps? Let's analyze
for what purpose. You'll find these are needed for loops on packet contents.
Solve it but supporting loops in "hardware"-controlled loops, which can't be
infinite.

Finally, platforms. It's beginning of sunset of x86 era now - RISC is coming.
ARM is now not only on mobiles, but on desktops and servers. Moreover, it's
era of specialized hardware accelerators - e.g. GPU, neural processors. Even
general purpose ARM64 has 31 register, and specialized hardware can
implement much more. Then, don't tie to Linux kernel - BPF helpers are very
rigid interface, from ancient era, like syscalls.

So, let's continue *Berkeley* Packet Filter with Berkeley RISC design - having
register window idea, updated by SPARC and then by Itanium (to not waste
registers). Take NetBSD's coprocessor functions which set is passed with
a context, instead of hardcoded enums of functions - for example, BPF maps is
not something universal, both NetBSD and FreeBSD have their own tables in
firewall.

Add more features actually needed for *network* processor - e.g. 128-bit
registers for IPv6 (eBPF axed out even BPF_MSH!). And do all of this in fully
backwards-compatible way - new language should allow to run older programs
from e.g. `tcpdump` to run without any modifications, binary-compatible
(again, eBPF does not do this - it's incompatible with classiv BPF and uses
a translator from it).

Next, eBPF took "we are masquerading usual x86 programming" way not only just
in assembly language. They have very complex ELF infrastructure around it which
may be not suitable for every network card - having pc-addressed literals, as
in RISC processors allows for much simpler format: just BLOB of instructions.
BPF64 adds BPF_LITERAL "instruction" of varying length (it's interpreted by
just skipping over contents as if it was jump), which, if have special
signatures and format, allow for this BLOB of instructions to contain some
metadata about itself for loading, much simpler than ELF (esp. with DWARF).

Then, ecosystem. eBPF defines functions callable from user code like:

> enum bpf_func_id___x { BPF_FUNC_snprintf___x = 42 /* avoid zero */ };

That is, ancient syscall-like way of global constant, instead of context. A
"context" here is the structure passed with code to execution which contains
function pointers of what is available to this user code, in spirit of
NetBSD's `bpf_ctx_t` for their BPF_COP/BPF_COPX extensions. This is not only
provides better way than "set in stone" syscall-like number, but BPF64 goes
further and defines an "packages" in running kernel with namespaces to allow
e.g. Foo::Bar::baz() function to call Foo::quux() from another BPF program,
populating ("linking") it's context with needed function without relocations.
These "packages" expected to be available to admin in e.g. sysctl tree, with
descriptions, versioning and other attributes.

Some other quotes about how restricted eBPF is:

> First, a BPF program using bpf_trace_printk() has to have a GPL-compatible license.
> Another hard limitation is that bpf_trace_printk() can accept only up to 3 input arguments (in addition to fmt and fmt_size). This is quite often pretty limiting and you might need to use multiple bpf_trace_printk() invocations to log all the data. This limitation stems from the BPF helpers ability to accept only up to 5 input arguments in total.
> Previously, bpf_trace_printk() allowed the use of only one string (%s) argument, which was quite limiting. Linux 5.13 release lifts this restriction and allows multiple string arguments, as long as total formatted output doesn't exceed 512 bytes. Another annoying restriction was the lack of support for width specifiers, like %10d or %-20s. This restriction is gone now as well

> Helper function bpf_snprintf
> Outputs a string into the str buffer of size str_size based on a format string stored in a read-only map pointed by fmt.
>
> Each format specifier in fmt corresponds to one u64 element in the data array. For strings and pointers where pointees are accessed, only the pointer values are stored in the data array. The data_len is the size of data in bytes - must be a multiple of 8.
>
> Formats %s and %p{i,I}{4,6} require to read kernel memory. Reading kernel memory may fail due to either invalid address or valid address but requiring a major memory fault. If reading kernel memory fails, the string for %s will be an empty string, and the ip address for %p{i,I}{4,6} will be 0. Not returning error to bpf program is consistent with what bpf_trace_printk() does for now.
>
> Returns
>
> The strictly positive length of the formatted string, including the trailing zero character. If the return value is greater than str_size, str contains a truncated string, guaranteed to be zero-terminated except when str_size is 0.
>
> Or -EBUSY if the per-CPU memory copy buffer is busy.
>
> static long (* const bpf_snprintf)(char *str, __u32 str_size, const char *fmt, __u64 *data, __u32 data_len) = (void *) 165;

So. In summary: BPF64 has not only traditional ("firewall on network packets")
area of use but also suitable - and having goals to design in mind - for:

* non-network. e.g. syscall arguments checking
* network protocols passing untrusted code which can be run by other side in
  restricted sandbox (e.g. muSCTP custom (de)compression rules)
* an alternative to https://capnproto.org/rpc.html for multi-method
  calls on high-latency links (e.g. MQTT5-like and/or for environments
  when Cap’n Proto itself is not applicable, e.g. no direct connections)

I've put a sketch of design to https://github.com/nuclight/bpf64 with files:

* The `nuc_ts_prog_kern.c` (and it's include `nuc_ts_common_kern_user.h`) is
  XDP/eBPF program (for Linux 6.5) for parsing TCP packet and incrementing
  it's Timestamp option, if any, recording statisitics intop eBPF map.
* The `nuc_ts_incr.baw` (and it's include `nuc_ts_incr.bah`) is the equivalent
  program doing the same thing, but in a new BPF64 Assembler Wrapper language,
  not yet written and subject to change. Note this is a lower-level language
  than C, viewed as intermediate solution until BPF64 becomes stable, after
  which more higher-level language (higher than C) should be written, at least
  as expressible as `tcpdump` (`libpcap`) one.
* `bpf64spec.md` for draft of Specification, but currently it's more in a
  "a draft of draft" state :-)

I am requesting comments about this architecture before implementing,
especially from people knowledgeable in JIT compilers, because, though while
I see BPF64 much more simpler than eBPF (probably just several man-months to
implement), my sabbatical has ended - and design mistakes are much harder to
fix when implementation already exists.

-- 
WBR, @nuclight


Home | Main Index | Thread Index | Old Index