Subject: Re: Kernel copyin/out optimizations for ARM... : revisited
To: None <port-arm@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 03/23/2002 21:07:14
On Sat, Mar 23, 2002 at 09:30:01PM +0100, Reinoud Zandijk wrote:
> Hiya folks,
> 
> i've followed this thread for some time now and something i dont completely
> grasp.... isnt the whole idea of optimising the copying of (big) stuff
> inside and in/out the kernel to userland not one of the design principles
> of the UVM memory system ?
> 
> Shouldn't we rather look at ways to shift/replace memory pages between
> userland and kernel? or otherwise change the API to make it possible? I
> think thats more the way to go than trying to keep on copying stuff... that 
> was the way of the old memory system.
> 
> OK... maybe i am a bit too optimistic, but maybe its also a clue :-D

There are an awful lot of small pieces of data that have to be
shifted to and fro.  Getting these right can give a reasonable
performance increase for a large set of tests cases.

Clearly there are some where 'page stealing' can be used.  But it
too has hidden costs.  You also need to get the copy loop
properly optimised before considering other cases.
(In the past I've improved the performance of network drivers
by ADDING a data copy!)

The test this all started with is copying a VERY large buffer between two
processes (via a pipe).  The cost of the process switches will be dominant
if the pipe write aren't atomic (I've not looked at the netbsd pipe
code), IIRC the sparc system doesn't need to flush its caches on process
switches...
I've noticed some new pipe (write?) code that does do page stealing - this
may not be in the ARM build.  Also it may not be a gain unless you have
very careful process scheduling - the data has to be read out of the
pipe before the writer updates the buffer (forcing copy on write).
Page stealing on read is somewhat harder - it requires the data to
have the same alignment in the kernel as in the target user buffer.

I think my 'optimised' copy was slower for smallish transfers - but I
don't know where the boundary is.  I need to sit and do a whole load
of tests.  Unfortunately writing a good copy routine is hard, and, to
some extent, dependant on whether the target data is needed in the
data cache (the strongarm chips have an odd cache architecture!).

My guess is that short transfers (certainly for copyin) are wanted in
the cache.  Very long transfers will flush the data cache on the read
side, so it doesn't matter.

If I do any further tests, I might have to check what happens in
user mode - but my SA1100 system doesn't run netbsd and always runs
in system mode - making some of the cache/mmu interactions difficult
to test...


	David

-- 
David Laight: david@l8s.co.uk