Subject: Re: SMP re-entrancy in kernel drivers/"bottom half?"
To: Kentaro A. Kurahone <kurahone@sigusr1.org>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 02/24/2005 15:01:24
>Hmmm.  Aren't there other issues with TCP/IP in general that cause
>scalability issues on the high end?  Namedly that unrealistically good
>link quality is neccecary to grow the window sufficiently?  (Though
>there are a number of proposed changes to address this.  RFC3649 and 3472
>for instance).

Sure.  But in the SIGCOMM or TCP community, thats very old news. Take
a gigabit flow, give it three back-to-back drop episodes, then ask the
students to compute how long linear growth will take to get back to a
gigabit. (Its a *long* time.) Hence the 10GbE estimates in RFC-3649,
with 1500-byte packets, 100ms cross-continent RTT, congestion windows
of 83,000-odd packets and drop rates no greater than of 2e-10.

Hence the pointed discussion of ``so, then: what does your TOE do when
the flow becomes congested and incurs a drop''?  In retrospect, it seems
like the iSCSI WG spent *all* their time going round and round that one.

Classic VJ TCP uses a additive-increase, multiplicative-decrease
(AIMD) response. Classic control theory tells us that AIMD will be
stable (i.e., it will avoid the "congestion collapse" scenarios which
hit the Internet in the late 80s). I've seen Jon Crowcroft allude to
new results in congestion control (possibly based on better analysis
of circles in the complex plane?)  which allows for a *much* gentler
response than multiplicative decrease. But as far as I know, that's
still considered experimental, with a *much* higher risk of triggering
congestion-collapse than Sally Floyd's HighSpeed TCP (which is also
considered experimental.)  But I'm not up with the latest work in that area.


>> So desiging a TCP stack that can only ever get high throughput from
>> TOE NICs strikes me as a losing proposition.
>> 
>> [*] With, I suppose, the possible exceptiohn of a Itanic with monster
>> 9MB caches. But NetBSD doesn't run on such CPUs yet anyway.
>
>IPoIB might be able to do it too.  I've seen IB saturate PCI-X before,
>but since that was using RDMA, I guess it's kind of hard to compare.
>(Plus, "You need to but a $827 dollar interface, and rediculously expensive
>cables/switches to make TCP go fast" seems subpar to me.)

I'd have guessed well over $10k per port, including switches, but I've
never actually looked.


>I was pondering picking up an Adaptec TOE NAC, but for $800, I'm not sure
>if I can justify the costs for the curiocity factor.

Me neither, and I sure wouldn't spend that on the Intel iSCSI NIC.  An
82543 (PCI-2.2, not PCI-X) chip, with TCP offload firmware running on
a StrongARM.  (Or maybe an Xscale, but I think not.)