Subject: SSE optimized memset
To: None <port-amd64@netbsd.org>
From: Kimura Fuyuki <fuyuki@hadaly.org>
List: port-amd64
Date: 01/18/2007 12:49:37
Hi folks,

I'm now looking at the SSE instruction set and thinking way to use it to boost 
some string functions, but seems quite difficult at least for generic use...

Anyway, here is my first try; an SSE optimized memset.

http://www.hadaly.org/fuyuki/memset.S.patch

The patch above adds a second booster to the current (well tuned) memset 
implementation.  It avoids cache pollution by adding "non-temporal" hints to 
MOV operations. With normal memset, just megs of  calloc() totally trashes 
cache contents. Too harsh for such a limited resource.

Here's also a regression test. (actually it can solely be put in the regress/ 
tree.)

http://www.hadaly.org/fuyuki/memset.tar.bz2

To tell the truth, I don't know the exact calling convention in NetBSD/amd64. 
Is that  same as in Linux? If kernel can freely break xmm registers, #ifdef 
could be removed.

Any reports or suggestions are appreciated.  Does the patch work on 
MP-machines? (I'm testing it on a cheap Celeron...)  Boost the things? (I 
feel some!) Ideas for tunes?