tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD 5.1 TCP performance issue (lots of ACK)

On 23 Nov, 2011, at 03:12 , Manuel Bouyer wrote:

> On Tue, Nov 22, 2011 at 03:10:52PM -0800, Dennis Ferguson wrote:
>> [...]
>> You are assuming the above somehow applied to Intel CPUs which existed
>> in 2004, but that assumption is incorrect.  There were no Intel (or AMD)
>> CPUs which worked like that in 2004, since post-2007 manuals document the
>> ordering behavior of all x86 models from the 386 forward, and explicitly
>> says that none of them have reordered reads, so the above could only a
>> statement of what they expected future CPUs might do and not what
>> they actually did.
> This is clearly not my experience. I can say for sure that without lfence
> instructions, the xen front/back drivers are not working properly
> (and I'm not the only one saying this).

I am very sure that adding lfence() calls to that code fixed it.  What I
suspect is that you don't understand why it fixed it, since I'm pretty positive 
original problem couldn't have been an Intel CPU reordering reads from cached
memory.  For example if the thing you did to generate the instruction was
either a function call or an `asm volatile ("lfence":::"memory")' it will
have effects beyond just adding the instruction and those effects, rather
than the instruction, might be what mattered.

>> This is clear in the post-2007 revision I have, where the section you quote
>> above now says:
> It also says that we should not rely on this behavior and, for compatibility
> with future processors programmers should use memory barrier instructions
> where appropriate.

If you are talking about the last paragraph in 7.2 it doesn't say you should
add memory barrier instructions where they serve no purpose.  It says you should
use a memory synchronization API that can be made to do the right thing if
ordering constraints become weaker in future.  With current hardware an
'lfence' instruction, while being costly to execute, is very nearly useless
(I've heard it is useful only for write-combining memory), so it makes no
sense for the API to generate it until there are CPUs which need it.

> Anyway, what prompted this discussion is the added bus_dmamap_sync()
> in thw wm driver. It's needed because:
> - we may be using bounce buffering, and we don't know in which order
>  the copy to bounce buffer is done
> - all the world is not x86.

Same thing.  I'm sure the bus_dmamap_sync() (or some bit of API which generates
a barrier instruction on machines would need it) is required there for some
machines other than the x86, but the fact is that the problem occurred on an x86
and a read barrier instruction by itself isn't fixing any problem there
(though apparently the compiler barrier that comes along with that might do
the trick in this case).

Dennis Ferguson

Home | Main Index | Thread Index | Old Index