tech-net: Making mbufs deal better with various buffer sizes

Subject: Making mbufs deal better with various buffer sizes
To: None <tech-net@netbsd.org>
From: Darren Reed <avalon@caligula.anu.edu.au>
List: tech-net
Date: 05/02/2004 13:38:36
Currently, the mbuf infrastructure is not well equipped to provide
good performance or ease of use with packets larger than MHLEN and
especially not those greater than MCLSIZE.  Well, maybe I lie there
a little, as NIC drivers will choose the best type of mbuf to use
when reading in packets.  How big are these two magic numbers?  On
i386, MHLEN is ~200, MCLSIZE, 2^11 (2048.)  If your application
regularly sends back and forward packets that are at least MHLEN bytes
in size, NetBSD is required to put in at least two standard mbufs.
If you are using NFS over UDP with a read/write size of 32k, then
you need 17 clusters (16*2 + headers) attached to an mbuf in order
to store all the data.  To put it politely, this is not an efficient
way to use or access data held in said buffers - or at least that's
my opinion.

So, do I have a magic wand for this?  No.  But I would like people
to give some thought to how it could be fixed so that doing this -
ifconfig wm0 mtu 8000 - didn't suck.

One solution that comes to mind is to scrap m_dat and m_pktdat altogether
and force all data to be put in clusters.  Then change the clustering
so it supports buffers in a set of integral sizes, such as 2^n for
n=[7,16].  Whist it is tempted to throw in a 1536 (not much wastage
from a full frame ethernet packet) I don't know if this is a good idea,
for this one case.

To compare with Solaris 2.6, netstat -k says they have data block
sizes of 72, 136, 328, 616, 1096, 1576, 1992, 2664, 4040, 8136, 12232
for STREAMS buffers.  In Solaris10, empircal evidence from kstat shows
a much larger varity of block sizes from 16 bytes up to 72k, including
strange numbers like 69456 (yes, I see a 65536.)

Which suggests that for real performance benefit, more than 30seconds
of thought should be given to actual buffer sizes supported, whether
that list should be fixed or dynamic, etc.

What are the implications of this kind of change?
Well, any part of the kernel that is well behaved in its use of
mbufs, i.e. always uses mtod() and does not use m_dat/m_pktdat directly,
ever, should not require any change.  uipc_mbuf.c would need a large
overhaul, as too would parts of uipc_mbuf2.c and mbuf.h.  Well, that
is the naive list :)  The fuller answer appears to include changing
all the NIC drivers to change how they interact with the mbuf code,
dealing with all those checks for MHLEN/MLEN/MCLBYTES and friends.
Although maybe the shortcut there is to make MHLEN & MLEN be '0' and
MCLBYTES bigger than the biggest single cluster we'd allocate.  Whilst
the MI part (kern/uipc_mbuf{,2}.c and sys/mbuf.h) could be easily be
done by one person, all the MD parts would require a lot of effort,
not to mention NIC driver changes, to the point of making doing this
infeasible.

It is also important that networking should continue to work on small
machines (4MB - 32MB of RAM.)

Finally, this kind of change must be benchmark'd before adoption to
prove that there is both performance and programming benefit, so there
is a potential that it could all be thrown away at the end if it is
slower or just doesn't work as well as expected.  Ideally the end
result should be less code in NIC drivers as well as better buffering.

Comments?

Darren