Re: Improving the data supplied by BPF

To: Arnaud Lacombe <lacombar%gmail.com@localhost>
Subject: Re: Improving the data supplied by BPF
From: Darren Reed <darrenr%netbsd.org@localhost>
Date: Sat, 27 Dec 2008 00:38:12 +1100

Arnaud Lacombe wrote:
> Hi,
> 
> On Thu, Dec 25, 2008 at 11:35 AM, Darren Reed <darrenr%netbsd.org@localhost> 
> wrote:
>> Recently I've talked with a few different folks about packet capture
>> and have become aware of some of the problems that people face when
>> trying to use BPF vs other propritary solutions that exist. While it
>> may be possible to capture data at a good rate with BPF, there is
>> important meta data that isn't provided.
>>
> could you details what BPF is missing vs. other proprietary solutions
> ? What a heavy tcpdump user can expect compared to the actual one ?

Notification about when packets are dropped, an indication of
whether or not the packet was going in or out... there are
additional characteristics, such as if the packet was an "error"
(i.e. bad ethernet CRC, runt, etc) but it appears BPF doesn't
see those anyway. Being able to easily find the start of the
packet, being told the complete size of the current record...

>> This set of diffs attempts to address that by introducing a new BPF
>>
> maybe your changes would be clearer if you only provided the diff made
> on BPF itself (about 10% of the whole diff), and a sample use-case.
> Everything else is only API change.

If you're sufficiently interested then I'm sure you can extract the
part that concerns you... but honestly, the code changes to BPF are
trivial. What's really important is what I included and what you've
commented on.

Simple use case? Say you've got a bridge port on your NetBSD box and
you capture packets on it. How do you know which packets were going
to boxes that are connected out that wire vs some other bridge port?
i.e. if you used tcpdump today, you've got a bunch of packets that
show a conversation between two hosts. How do you know from the
capture which packets were sent out the NIC vs which were received?

Say you've got two raw capture files from different interfaces that
have different media types. How do you merge them into one for easier
analysis? (Current pcap files encode the link type in the file header,
thus implying every packet has the same MAC type.)

>> The purpose of the sequence number is to provide the rolling counter
>> of the packets captured for the one in question. Thus if in successive
>> reads the count went from 2 to 5, you know 3 packets have been missed.
>>
> what if the count goes from 3 to... 3, ie. the seq number overflowed
> (for whatever reason) ?

So while the program was sleeping, 4 billion packets went through.
Well, I suppose that's only an hour or so of sleeping with line
rate on a 10G card. I think there's a chance that a sequence number
wrap will be noticed in those conditions.... not to mention that
grabbing the BPF statistics would show a very very large delta in
bs_drop. But left long enough on a fast NIC, even that will wrap.

>> /*
>>  * Enhanced BPF packet record structure
>>  */
>> typedef struct ebpf_rec_s {
>>       uint64_t        ebr_secs;       /* No more Y2k38 problem */
> why unsigned ? currently `tv_sec' is signed. Why not using time_t ?
> There is an obvious ABI breakage when we will switch to 64bits time_t
> but this is be a better type than raw integer. The breakage is a
> different trouble and should be dealt with separately.

I'd use "time_t" here but I don't want to risk that being
mistake for a 32bit value. uint64_t allows me to be specific
about the size of the field. Why unsigned? Because unsigned
containers never influence the value that gets put in them.

>>       uint32_t        ebr_nsecs;
>>
> why do you want nano second precision if you getting your information
> from a micro second precision variable. There is no information gain
> there, and your code reflect this (ie. you just "* 1000" to get the
> nano second value from the micro second value).

Lets see... with a 10GB port, what do you think the spacing
is between packets when they're arriving at a rate of 10,000,000
per second? Finer than microsecond granularity can provide.
I don't know what the current line speed tests of NetBSD are
with 10G cards, but at Sun I've seen boxes forwarding at
greater than 50% of 10G line speed (>5,000,000 pps.)

The point of defining the field in this manner is to make it
easily possible for future code changes to take advantage of
the extra precision available.

> This field would have a meaning if you change the call the call to
> microtime() to nanotime() in bpf_tap()/bpd_deliver() and build a
> homegrown `struct timeval' in the non-extended capture format. You
> don't have any precision loss in that case.

I'm just trying to leverage off of existing code and make the
minimal amount of changes necessary to support a new format.

But by defining a new time format to use nano-seconds rather
than microseconds, I make the change you've described possible.

For example, I don't know if nanotime() is designed to be called
1 million or more times a second... it may be the wrong thing to
use when it becomes necessary to deal with packets at that speed.
Even now, microtime isn't that fine-grained (it's rather chunky),
so I'm not trying to pretend that nano-second precision is
possible with the existing APIs but at the same time, if a change
is to be made then it needs to look forward and that means using
nanoseonds here.

> btw, why not just using a `struct timespec' ?

Because I don't want there to be any vagueness about the size of
the field to store seconds in.

For example, even -current on i386 defines time_t (which timespec
uses) as being a "long", so it would be 32bits. Again, if change
is to be made then we need to apply some amount of future-proofing.

I suppose that we could define it in terms of picoseconds if you
feel that nanoseconds is not enough and make it a 64bit field too?

>>       uint32_t        ebr_seqno;      /* sequence number in capture */
> how to detect wrap in sequence number ?

That's up to the consumer to decide. Whatever size field is used,
there's always going to be a "wrap problem", no matter what sort
of counter or wrap-counting counter is used.

I could almost be convinced to make this a 64bit counter but the
counter it pulls information from (bh_ccount) is only 32bits on
some platforms (its a long in bpfdesc.h) so it's possibly a waste
of bits, anyway. Then again, maybe bh_{c,d,r}count should all be
forcibly bumped to 64bits and then this also...

> As we have timestamps, this can be use to order sequence number as
> done with TCP's PAWS I guess.

This field isn't there for sequencing, it's to provide the
consumer of the BPF data with knowledge about whether or not
there has been a dropped packet in the black of data received.

>>       uint32_t        ebr_flags;
>>       uint32_t        ebr_rlen;       /* 16 bits is not enough for
>> IPv6   */
>>       uint32_t        ebr_wlen;       /* Jumbograms, so we have to
>> use    */
>>       uint32_t        ebr_clen;       /* 32 bits to represent all
>> lengths */
>>       uint32_t        ebr_pktoff;
>>       uint16_t        ebr_type;       /* DLT_* type */
>>       uint16_t        ebr_subtype;
>> } ebpf_rec_t;
>>
>> /*
>>  * rlen = total record length (header + packet)
>>  * wlen = wire length of packet
>>  * clen = captured length of packet
>>  * pktoff = offset from ebr_secs to the start of the packet data (may not be
>>  *          the same as sizeof(ebr_rec_t))
>>  *
>>  * flags are asa below:
> s/asa/as/ :)
> 
>>  */
>> #define        EBPF_OUT                0x00000001      /* Transmitted
>> packet */
>>
> I guess there will also be EBPF_IN, do you forsee any other flag possible ?

How do you know if something is black?

If EBPF_OUT isn't set to indicate out, doesn't that
then imply that the packet is an "input" packet?

Darren

Follow-Ups:
- Re: Improving the data supplied by BPF
  - From: David Young

References:
- Improving the data supplied by BPF
  - From: Darren Reed
- Re: Improving the data supplied by BPF
  - From: Arnaud Lacombe

Prev by Date: Re: Improving the data supplied by BPF
Next by Date: Re: Improving the data supplied by BPF
Previous by Thread: Re: Improving the data supplied by BPF
Next by Thread: Re: Improving the data supplied by BPF
Indexes:

Home | Main Index | Thread Index | Old Index