tech-perform archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [SOLVED]Re: Only 8 MB/sec write throughput with NetBSD 5.1 AMD64



On Fri, Oct 14, 2011 at 03:08:51PM -0700, Tony Bourke wrote:
> > 
> > Why assume?  And why trust benchmarketing (are there really any industry
> > publications that don't primarily engage in precisely that, any more?)
> 
> One of the benchmarks I referenced wasn't from any company, just a curious 
> individual in the virtualization field. The other was a test done by VMWare 
> and NetApp, and they weren't comparing themselves to competitors, just 
> comparing protocols that were already supported. I don't see how that's 
> benchmarketing. 

This is going to look very different on virtual "hardware" and real
hardware.  The costs of manipulating the VM system are dramatically
different, and, often enough, the virtual system already has to (or
chooses to) move data around by page flipping.

VMware is going to recommend whatever looks best for their virtualized
environment -- which includes a virtual "switch", which will want more
resources (and possibly have to copy more data) to handle large frames.
Would they deliberately cook a benchmark to favor whatever looks good for
them, but indifferent or bad on real hardware?  Probably not.  But do they
care if they accidentally report such results and claim they're general?
Again, probably not.

> > It is very, very easy to measure for yourself and see.  

You repeat this with no comment.  I assume you didn't try the experiment,
however.

> > Note that 9K
> > MTU is strictly a lose unless you have a >9K hardware page size as the
> > SGI systems where the 9K MTU was originally used did.
> 
> Do you have any references to back that up? I'm geniuinly curious. 

There was a beautiful PowerPoint presentation on this on one of the
FreeBSD developers' home pages for many years.  I can't find it any more,
which is frustrating.

But, for historical perspective, here is where that weird 9K frame size
comes from, and why it's a poor choice of size for most newer systems:
In the era of SGI Challenge hardware (very large early R4xxx systems)
NFS was almost always used over UDP transport rather than TCP, and most
NFSv2 implementations couldn't use any RPC size other than 8192 bytes.

With Ethernet sized MTUs, the result was that every single NFS RPC for
a bulk data transfer could be guaranteed to generate 6 IP fragments.  Not
good.

So SGI used a a HIPPI interface (which could run with a very large MTU)
to evaluate the effect of different MTU settings on UDP NFS performance.
They found a sweet spot at 8K + RPC headers + UDP headers + IP headers +
HIPPI headers -- which works out to just about 9K -- and published this
result.  At around the same time, Cray did similar testing and found 30K
to be optimal but didn't publicize this as widely (see, for example,
http://docs.cray.com/books/S-2366-12/html-S-2366-12/zfixedhv8xfxfn.html).

This was taken up by the people working on the various Internet2 projects,
and there was a resulting call to make the whole Internet backbone "9K
clean".

Unfortunately, it appears nobody did much testing on anything with a
page size smaller than 16K.  Because that's the other interesting thing
about SGI's result that seemingly didn't occur to them at the time:

It turns out that 9K has _two_ important properties that made it appear
optimal in their benchmark: it is *large* enough to hold an encapsulated
8K NFS transaction but it is also *small* enough to fit in one hardware
page, thus minimizing memory allocation overhead (and presumably thrashing
of pages between CPUs in their large multiprocessor system, too).  But
that is only true because they had a 16K page size.

Anyway, there were other large frame sizes "in play" at the time: 3Com
and others were pushing the 4K FDDI frame size for use with Ethernet,
for example, even at 100Mbit.  And early Gigabit Ethernet equipment
supported a whole range of weird frame sizes all approximately 9K -- for
example at one point I evaluated switches for an HPC cluster and found
they had maximum MTUs of 8998, 9000, 9024, and 9216 bytes respectively.

Unsurprisingly consumers and system designers started to do their own
testing to see which of these frame sizes might be optimal.  This would
be around 2001-2002.  CPUs were slower and network adapters didn't do
coalescing optimizations like segmentation offload or large receive, so
it was a lot easier to see the effect of reducing packet count.  I know
Jason Thorpe and I had a long discussion of this in public email -- I
thought it was on one of the NetBSD lists, but searching around, it
looks like it may not have been.  Right around the same time we were
converging on an optimal MTU of 8k - headers, one of the FreeBSD developers
profiled their kernel while benchmarking with different frame sizes and
made this beautiful graph that explained why this size was a win: because
it minimized the allocation overhead for unit data transferred.

As I recall, the difference between (4k - headers) and (8k - headers) on
a system with 4K pages is not actually that large, but 8k wins slightly.
What is definitely a lose is 9K, where you allocate a whole extra page
but can't really put any useful amount of data in it.

Of course none of this benchmarking is even worth doing if you're not
careful to eliminate other bottlenecks -- particularly, small socket
buffer sizes, which artificially constrain the TCP window and are
probably the single most common cause of "poor throughput" complaints
on our mailing lists -- though of course that's only for a single-stream
test; for a multi-stream application, you often _want_ small socket buffer
sizes, for fairness.  Measuring this stuff and then applying the results
to the real world is not simple.

But I firmly believe you have to start with solid measurements of the
basic numbers (such as single-stream TCP throughput) before you try to
draw general conclusions from benchmarks of specific very complex
applications where there are many confounding factors -- like, say,
iSCSI between virtual machines.  And if it weren't 9AM on a Sunday and
my very cute 4 year old weren't clamoring for attention, I'd fire up
a couple of spare systems and generate some results for you -- but it
is, so you'll have to run the test yourself, or perhaps pester me to
do it later.

One thing to be _very_ aware of, which changes the balance point
between large and small maximum frame size considerably, though, is
logical frame coalescing -- on receive and on send.  This lets the
kernel (and the adapter's descriptor mechanism) treat small frames
like large ones and eliminate most of the software overhead involved
with small frame sizes.  This is "segmentation offload" or "large send"
on the transmit side, and "receive side coalescing" or "large receive"
on the receive side.

NetBSD supports segmentation offload but *not* receive side coalescing.
So results with NetBSD, particularly for receive side CPU consumption,
at small frame sizes, may be different (worse) than what you see with
some other operating systems -- and the beneficial effect of large
frames larger, as well.

Also, don't forget there are a lot of cheap network adapters out there
with poor (or poorly documented, thus poorly supported) receive size
interrupt moderation, and that's another case where reducing frame
count for the same amount of data transferred really helps.

> > I really can't agree with you about path MTU discovery, either.  With
> > proper blackhole detection (and if you don't have that, then you
> > should not be using path MTU discovery at all) it's plenty reliable;
> > and in any event, using a large local MTU won't cause a sudden magic
> > change of default to use the link layer MTU as the inital MTU for
> > remote peers anyway; only local ones.
> 
> If you're talking about MSS clamping, I agree.

I don't know what MSS clamping has to do with this, so I can't comment.
My basic point is that running path MTU discovery without blackhole
discovery is insane (even on IPv6 networks) but that with it, it works
fine; also that using path MTU discovery does *not* cause large packets
to be sent to remote peers without probing, so even in networks with
broken routers that don't generate or pass needs-frag ICMP messages,
path MTU does work -- which means mismatched local and remote MTU size
across internets is harmless.

>But at the same time, is the original poster upping his MTU going to
>help?

Quite possibly: he may have a stupid network card without large send
optimization or decent receive interrupt moderation, and he may have
his socket buffers set wrong (which makes TCP very sensitive to
latency even when there's plenty of bandwidth available).  In these
cases, using a larger MTU might immediately show him better performance;
and it is very easy for him to test, rather than simply being persuaded
not to because of random references to magazine articles thrown out
onto a mailing list.

The best suggstion in this thread, however, was David's, namely that
he ensure his socket buffer sizes are appropriately large.

> Unlikely, and it would likely complicate his network needlessly. 

That's your opinion.  Unfortunately, you backed it up with a lot of
vague details which were, as far as I can tell, just plain wrong (like
your claim that path MTU doesn't work, and your wrong math about PCI
bandwidth).

When misinformation like that flows out onto our lists and isn't
contradicted, it's a problem for everyone.  So I apologize if you feel
I'm being unduly adversarial, but I think it really is important that
when a user asks for help with _X_, and someone responds with related
misinformation about _Y_, both topics receive equal attention and we
try to end up with the best possible answer to each.

Thor


Home | Main Index | Thread Index | Old Index