Subject: Support for zero'ing pages in idle loop
To: None <current-users@netbsd.org>
From: Jason R Thorpe <thorpej@zembu.com>
List: current-users
Date: 04/24/2000 10:27:54
Hi folks...

I've just committed code that implements pre-zero'ing of pages in
the idle loop.  This helps zero-fill page faults a fair bit, and
also speeds up e.g. page table allocation.  Some lmbench results
before and after, on a 300MHz Celeron w/ 32M of RAM:

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh  
                             call  I/O stat clos       inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
i386-netb   NetBSD 1.4X  298  1.2  5.1   15   24 0.05K  2.7    4 1.1K   4K   8K
i386-netb   NetBSD 1.4X  298  1.2  5.2   16   23 0.05K  2.7    4 0.8K   4K   7K

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page       
                        Create Delete Create Delete  Latency Fault   Fault 
--------- ------------- ------ ------ ------ ------  ------- -----   ----- 
i386-netb   NetBSD 1.4X                                 7298          9.1K
i386-netb   NetBSD 1.4X                                 7098          7.7K

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------   ---  ----   ----    --------    -------
i386-netb   NetBSD 1.4X   298    10     36         202
i386-netb   NetBSD 1.4X   298    10     36         200

Note the L2 latency didn't change -- the i386 implementation does the
access uncached.  Compare this with doing it cached:

i386-netb   NetBSD 1.4X   298    10     87         200

"ouch."  For this reason, portmasters who glue this in to their port's
idle loops should provide an uncached method like the i386 port does,
and if that is not possible, probably elect to just not do it at all
until we can come up with a way of doing this cached such that the
cache footprint can be minimized.

In other words, consider this a work-in-progress, with some incremental
benefit along the way :-)

Note that lmbench really really really stresses out memory allocators,
and doesn't leave the system with much idle time.  For this reason, the
numbers might not look that impressive (there's a large miss rate for
zero-allocations, even though there is some improvemenet in overall
performance).  The hit rate is much better for "normal" system activity
(e.g. running X, netscape, compiling stuff, etc.) and I've observed the
netscape-startup-time benchmark improve by 50-60% on the same system I
ran the lmbench on.

Enjoy!

-- 
        -- Jason R. Thorpe <thorpej@zembu.com>