tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD 5.1 TCP performance issue (lots of ACK)



On Wed, Nov 23, 2011 at 12:12:05PM +0100, Manuel Bouyer wrote:
> On Tue, Nov 22, 2011 at 03:10:52PM -0800, Dennis Ferguson wrote:
> > [...]
> > You are assuming the above somehow applied to Intel CPUs which existed
> > in 2004, but that assumption is incorrect.  There were no Intel (or AMD)
> > CPUs which worked like that in 2004, since post-2007 manuals document the
> > ordering behavior of all x86 models from the 386 forward, and explicitly
> > says that none of them have reordered reads, so the above could only a
> > statement of what they expected future CPUs might do and not what
> > they actually did.
> 
> This is clearly not my experience. I can say for sure that without lfence
> instructions, the xen front/back drivers are not working properly
> (and I'm not the only one saying this).

Are the xen front-/back-end drivers otherwise correct?  I.e., using
volatile where they ought to? wm(4) definitely does *not* use volatile
everywhere it ought to, and I've just found out that that explains this
bug.

I've just tried the same experiment on the netbsd-5 branch.  The
compiler generates different assembly for wm_rxintr() before and after.
The before-assembly definitely loads wrx_len before wrx_status, which is
wrong; the after-assembly loads wrx_status, first.  So we can explain
the wm(4) bug with re-ordering of reads by the compiler, not the CPU.

(BTW, in -current, when I added volatile to the rx descriptor
members and recompiled, the compiler generated the same assembly for
wm_rxintr().  Makes me wonder, does the newer GCC in -current cover a
lot of bugs?)

> > This is clear in the post-2007 revision I have, where the section you quote
> > above now says:
> 
> It also says that we should not rely on this behavior and, for compatibility
> with future processors programmers should use memory barrier instructions
> where appropriate.

Agreed.

> Anyway, what prompted this discussion is the added bus_dmamap_sync()
> in thw wm driver. It's needed because:
> - we may be using bounce buffering, and we don't know in which order
>   the copy to bounce buffer is done
> - all the world is not x86.

I agree strongly with your bullet points, and I think that by the same
rationale, we need one more bus_dmamap_sync(). :-)

Maybe I do not remember correctly, but I thought that the previous
discussion of how many _sync()s to use, where they should go, and why,
left off with me asking, "what do you think?"  I do really want to know!

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981


Home | Main Index | Thread Index | Old Index