tech-kern: bug in wired accounting [[was Re: Can't lock even 4GB on system with 8GB RAM? ]]]

Subject: bug in wired accounting [[was Re: Can't lock even 4GB on system with 8GB RAM? ]]]
To: None <tech-kern@netbsd.org, tls@rek.tjls.com>
From: Jonathan Stone <jonathan@Pescadero.dsg.stanford.edu>
List: tech-kern
Date: 01/15/2006 13:24:29

In private email (message <20060114232542.GA21291@panix.com>),
Thor Lancelot Simon writes:
>On Sat, Jan 14, 2006 at 01:42:55PM -0800, Jonathan Stone wrote:
>> 
>> Hi Thor,
>> 
>> Did you get my equally (no, *more*) frustrated reply and patch?  That
>> patch is, as noted, gross, [...]
>
>It does seem to work.  It does seem gross. :-)

It's worse than that; it works for only *one* process.

Suppose that one has an appliation where one needs to feed reams of
data through a 4GiB lookup table, and perhaps partition the input into
distinct outputs based on the lookup.

One really *has* to mlock() the lookup data into memory, because our
VM page-scanner is *pitifully* weak for applications like this.  (I
measured 7+ hrs elapsed time on NetBSD-3.0/amd64, non-mmaped, vs. 2
hrs on Linux, without any special mlock() in the Linux code).

Now, suppose one has a dual-processor system (e.g., two socket-940,
single-core amd64 CPUs. (Extension to 4 or 8 CPUs is an exercise for
the reader's wallet.)  Each copy of the application will attempt
mmap() and lock the lookup table.

But uvm_mmap() limit tries to apply the check that
	(mapped size + uvmexp.wired ) <= umexp.wiredmax)

without checking if the region is *already* locked!  

So if I set wiredmax to 4GB on a 8GB machine, I want to run multiple
copies of my app, and the issues mmap() with M_WIRED, then only the
*first* copy of the app will succeed: the kernel is too dense to check
if the region being mapped is already mapped and locked, and thus
already accounted for:

      if ((flags & MAP_WIRED) != 0 || (map->flags & VM_MAP_WIREFUTURE) != 0) \
{
                if (atop(size) + uvmexp.wired > uvmexp.wiredmax ||
                    (locklimit != 0 &&
                     size + ptoa(pmap_wired_count(vm_map_pmap(map))) >
                     locklimit)) {
                        vm_map_unlock(map);
                        uvm_unmap(map, *addr, *addr + size);
                        return ENOMEM;
                }

wiredmax is system-wide.  So for a shared region being mapped and
locked multpile times, the code should instead compute something like:

	 lockdelta = atop(size) - pmap_wired_count(vm_map_pmap(map))

and use lockdelta as the appropriate size limit for wiredmax.

I can see arguments either way for the per-process limit, but since
wiredmax is the documented "system-wide" limit, wiredmax checks should
apply to the actual, system-wide locked memory, i.e., the delta.

It seems to go without saying that mmap() with MAP_WIRED should behave
the same way, w.r.t. pagelocking, as mmap() without MAP_WIRED followed
by an mlock() of the mapped region; ditto mlockall(), if and when part
of the address space in question is already locked.

Comments? Particularly from our uvm experts?