Re: ATF t_mlock() babylon5 kernel panics

To: current-users%netbsd.org@localhost
Subject: Re: ATF t_mlock() babylon5 kernel panics
From: Jason Thorpe <thorpej%me.com@localhost>
Date: Tue, 12 Mar 2019 23:21:59 -0700
> On Mar 12, 2019, at 9:09 PM, Robert Elz <kre%munnari.OZ.AU@localhost> wrote:
> 
> The first issue I noticed, is that t_mlock() apparently belives
> the malloc(3) man page, which states:
> 
>     The malloc() function allocates size bytes of uninitialized memory.  The
>     allocated space is suitably aligned (after possible pointer coercion) for
>     storage of any type of object.
> 
> and in particular, those last few words.   The "any type of object" that
> t_mlock wants to store is a "page" - that is a hardware page.

The test employs a bogus understanding of how malloc() is specified.  On x86, malloc() should return memory that is 16-byte aligned because that is the maximum alignment requirement of the fundamental types used by the compiler.

>   It obtains
> the size of that using:
> 
> 	page = sysconf(_SC_PAGESIZE);
> 
> and then does
> 
>        buf = malloc(page);
> 
> and if buf is not NULL (which it does check) assumes that it now
> has a correctly page aligned page sized block of memory, in which
> it can run mlock() related tests.
> 
> Something tells me that the "any type of object" does not include this
> one, and that t_mlock should be using posix_memalign() instead to allocate
> its page, so it can specify that it needs a page aligned page.

Correct.  Or mmap() (which always returns page-aligned pointers).

> Again, I am not proposing fixing the test until the kernel issues
> are corrected, but it would be good for someone who knows what alignment
> malloc() really promises to return (across all NetBSD architectures)
> to rewrite the man page to say something more specific than "any type of
> object" !

I've also seen the term "fundamental object" used.  One has to remember that malloc() is specified by the C standard, and C has no notion of "pages" or any other such silliness that we Unix people assume are fundamental :-)

> NetBSD's mlock() rounds down, so regardless of the alignment of the
> space allocated, the mlock() tests should be working (the locking might
> not be exactly what the test is expecting, but all it is doing is keeping
> pages locked in memory - which pages exactly this test does not really
> care).

POSIX specifically states that mlock() //may// require that the address is page-aligned ... Our implementation does not require this:

<snip>
        /*
         * align the address to a page boundary and adjust the size accordingly
         */

        pageoff = (addr & PAGE_MASK);
        addr -= pageoff;
        size += pageoff;
        size = (vsize_t)round_page(size);
</snip>

That is to say, the intent of our implementation is to wire the page where the range begins through the page where the range ends.  Note that internally, UVM represents all ranges as [start, start+size) (assuming start and size are page aligned / rounded).


> On my test setup, the kernel did not panic.   It does however experience
> some other weirdness, some of which is also apparent in the bablylon5
> tests, and others which might be.
> 
> My test system is an amd64 XEN DomU - configired with no swap space, and
> just 1GB RAM.   It typically has a tmpfs mounted limitted to 1/2 GB
> (actually slightly more than that - not sure where I got the number from,
> there may have been a typo... the -s param is -s=537255936 in fstab.
> That oddity should be irrelevant.
> 
> The first thing I noticed was that when I run the t_mlock test in this
> environment, it ends up failing when /tmp has run out of space.   And I
> mean really run out of space, in that it is gone forever, and nothing I
> have thought of so far to try gets any of that space back again.

And there are no files there?  Even an open-unliked file should disappear when the offending process exits.

> I assume that jemalloc() (aka malloc() in the test) is doing some kind
> of mmap() that is backed by space on /tmp and continually grabbing more
> until it eventually runs out, and that the kernel never releases that
> space (even after the program that mapped it has exited).   That seems
> sub-optimal, and needs fixing in the kernel, anonymous mmap's (or whatever
> kind jemalloc() is doing) need to be released when there are no more
> processes that can possibly use them.

Well, note that tmpfs also uses anonymous memory.  Is it that "df" on the tmpfs is really showing a bunch of space allocated to the tmpfs?

> I did not try umount -f (easier to just reboot...) but a regular umount
> failed (EBUSY) even though there was nothing visibly using anything on
> /tmp (and I killed every possible program, leaving only init - and yes,
> that did include the console shell I use to test things).
> 
> Umounting the tmpfs before running the t_mlock test worked fine (which also
> illustrates that none of the very few daemon processes, nor the shell, etc,
> from my login, are just happening to be using /tmp - and that it is the
> results of the malloc() calls from t_mlock that must be the culprit.
> (While ATF is running, it would be using /tmp as both its working
> directory, and for output file storage, but after the test fails, all of
> those processes are gone).
> 
> With an unmounted /tmp that issue (overflowed /tmp) was gone - but that
> might just be because the (single) filesystem that everything lives on
> in this test environment is fairly large, and very little of it is used
> by the fairly bare-bones NetBSD that is installed in it (it is plenty
> big enough to store all of the temporaries and results of a release build
> when I use it to run that kind of test.   That hasn't been needed (by me)
> for a while, and I clean it out and do whole new fresh builds from time
> to time, so right now it is only about 3% occupied)
> 
> Next issue: the kernel prints lots of
> 
> [  41.2801543] pmap_unwire: wiring for pmap 0xffffc98002f6b5c0 va 0x7f7ff7ef1000did not change!
> 
> (note I do not yet have gson's recent change to insert the missing space...)
> 
> That happens during the mlock_clip subtest, which passes.
> 
> After allocating a buffer (as above), this test is wholly
> 
>        for (size_t i = page; i >= 1; i = i - 1024) {
>                (void)mlock(buf, page - i);
>                (void)munlock(buf, page - i);
>        }
> 
> which isn't really much of a test, as it doesn't even attempt to
> check the results of the mlock/munlock.   "page" is the page size.
> (4KB, 8KB or 16KB or whatever it is on an x86_64 (Intel branded processor)).

We use basic 4K page size.

> From the number of kernel messages, I'm assuming one from each of
> those mlock/munlock calls (possibly excepting the first, but because
> a similar buffer was malloc()/free() 'd in the previous sub-test, it
> is conceivable that the page started out locked - that is, if malloc()
> returned the same page as last time.
> 
> The buffer is free()'d after that loop ends (of course).
> 
> First question is why the kernel feels a need to printf in this
> situation, and what that printf actually means.    Anyone ?

In a nutshell, that printf says that UVM thought the page was wired, but the pmap layer did not.

Longer version:

When UVM is wiring (as one would a license plate to the bumper on one's farm work truck) a range into memory, it informs the pmap layer that it considers this page "wired down" and that the pmap layer should take care not to evict it on its own if the resources used to map that page are needed to satisfy another mapping at a later time (the Sun3 MMU comes to mind).  It does this by calling pmap_enter() with the flag PMAP_WIRED. The pmap makes note of this in an implementation-specific way; the x86 uses a bit in the PTE that is unused by hardware.  Various statistics used to enforce resource limits are also updated.

When UVM no longer considers the page "wired down", it informs the pmap layer of this by calling pmap_unwire().  It's the pmap's job to update it's own internal accounting, statistics, etc.

What this message is telling you is that there's an inconsistency between what UVM thinks about the page and what the pmap layer thinks about the page.  There are a number of possibilities, including, but not limited to:

	-- pmap_enter(..., PMAP_WIRED) was called on the wrong page.

	-- pmap_unwire() was called on the wrong page.

> 
> Next, issue, the mlock_err() test fails - this is a test of mlock()
> and munlock() calls that are supposed to fail.   All the simple ones do
> (attempting to mlock(0) or mlock(-1) for example (or to munlock() those).
> 
> Then the test tries something that POSIX says may fail, but which we
> do not cause that to happen
> 
> 	mlock((void *)(((uintptr_t)buf) + page/3), page/5);
> 	munlock((void *)(((uintptr_t)buf) + page/3), page/5);
> 
> The test includes code to verify that both of those return 0.  That
> has recently started failing (ie: this test used to work).
> 
> My guess (and that is all it is for now) is that "buf + page/3" might,
> with the new malloc(), no longer be within the same hardware page as "buf"
> and so when rounded down to a page boundary, is giving an address in
> the next page, rather than the "one true page" that the test is assuming
> it has allocated, and that that page is outside the space that has been
> mmap()'d by jemalloc(), and consequently does not exist in the process
> address space, resulting in an error.   That error should be ENOMEM but
> the test doesn't bother to reveal that info.   Anyway, for this one I
> am assuming just the test bug (bad assumption about malloc()) which will
> go away when the test is fixed eventually.
> 
> Next test is the mlock_limits() test - as a guess (and again, this is
> just a guess) it looks as if this might be where the loss of space in
> the tmpfs tests happens.
> 
> This test does the standard buf = malloc(page), then forks (so both
> processes should have that "page" allocated - via non-shared mmap()'d
> space, I am presuming).    The parent does nothing but wait for the
> child, which uses its exit status to communicate whether the test
> passed or failed (and the parent gets back "passed" (ie: exit(0) from
> the child).   The parent then does free(buf) and is finished.
> 
> The child manipulates its RLIMIT_MEMLOCK constantly setting it to 1
> byte less than the amount it is going to mlock() in a loop like
> 
>                for (ssize_t i = page; i >= 2; i -= 100) {
> 
>                        res.rlim_cur = i - 1;
>                        res.rlim_max = i - 1;
> 
>                        if (setrlimit(RLIMIT_MEMLOCK, &res) != 0)
>                                _exit(EXIT_FAILURE);
> 
>                        if (mlock(buf, i) != -1 || errno != EAGAIN) {
>                                (void)munlock(buf, i);
>                                _exit(EXIT_FAILURE);
>                        }
>                }
> 
> (in which for brevity here, I have eliminated a call to fprintf()
> and setting errno to 0 before the mlock(), though that should not
> be necessary - errno is only tested if mlock() fails, in which case
> it should have been set to the reason why it failed.   EAGAIN (according
> to mlock(2)) means the locking "would exceed either the system or
> per-process limit for locked memory" - where here the latter is intended
> to happen.
> 
> Since this test passes, one can assume that every call of mlock() does
> return -1, and in every case, errno gets set to EAGAIN.   That's good.
> 
> Note that the test does no munlock() calls, nor does it free the page
> buffer ("buf") - it simply exits when all has succeeded (or failed,
> though if that happened, it would munlock() - or attempt to).

mlock() starts by first splitting, if necessary, vm_map entries, and then walking the list of entries that cover the range.  As it walks them, it increases the "wire count" for that map entry.  On the 0->1 edge, it then wires all of the pages backing that map entry.  In the case of anonymous memory, this will be an amap, which references a set of anons that actually own the pages.  All of this is handled by uvm_map_pageable().

The wiring case is done in two passes (because of complicated locking considerations):

	-- Pass 1 creates any amaps that might be needed and bumps the wire count for each entry it visits.

	-- Pass 2 calls uvm_fault_wire() for each entry it visits if the wire count is 1 (i.e. went from 0->1 in Pass 1) to being in the page from its backing store (file or zfod) and stuffing it into the MMU with the "wired" bit set.

> 
> Also note that nowhere does the test ever write to, or for that matter,
> read from, the "buf" that was allocated in the parent, then inherited
> over the fork() (there is just one fork() - no loops around that).
> That is, there would be no reason for the kernel to have made a copy
> of the pages via CoW.    I'm assuming (guessing, in fact, very very wild
> guess) that this is significant in the "full tmpfs" issue) - as it would
> be quite rare, I presume, for a process with a mmap'd anonymous pages
> to fork, and then never touch the pages at all (just attempt to manipulate
> the kernel state for them via mlock()).   Whether the mlock() actually does
> anything at all (changes any kernel state) is unknown - it shouldn't.

POSIX's language only requires that an mlock()'d region remain "memory-resident".  Our man page includes the language:

<snip>
After an mlock call, the indicated pages will cause neither a non-
resident page nor address-translation fault until they are unlocked.
They may still cause protection-violation faults or TLB-miss faults on
architectures with software-managed TLBs.
</snip>

This implies that CoW is not resolved when the page is wired, but it's been an awfully long time since I dabbled in UVM, so I honestly don't remember if we pre-CoW those or not.

> 
> I am presuming that this is where the tmpfs() is all consumed (though I
> jave no idea why it would be all - and it is not impossible that happened
> in one of the earlier tests) as the next test mlock_mmap doesn't use the
> normal paradigm for allocating memory, instead it 
> does
> 
>        buf = mmap(NULL, page, PROT_READ | PROT_WRITE,
>            MAP_ANON | MAP_PRIVATE | MAP_WIRED, -1, 0);
> 
> (written slightly differently, but that is what happens).
> 
> After the jemalloc() changes, this mmap() call fails (the test
> goes on to check buf != MAP_FAILED and that check fails.
> 
> This should indicate that there is no more available anon memory.

**OR** it indicates that the limit on wired memory has been exceeded.

> What the rest of this test would do is irrelevant, as it doesn't do it.
> 
> The final test case is mlock_nested which does the normal buf = malloc(page)
> which succeeds (presumably returning the same page as every other time)
> 
> The test then runs a loop (100 times) doing mlock(buf, page)
> (without changing the parameters, or doing munlock()) each time.
> This verifies that a locked page is allowed to be locked again
> (after the first iteration, which initially locks it).
> 
> If one of those mlock() calls fails, the test fails, and does not
> munlock() the page presumably locked the first time around, nor
> does it free the buffer.   But since this is the end, the whole
> test will exit at this point.

...which should unwire all of the pages because the entire vm_map is torn down.

> This test (after jemalloc()) also fails, though again the test does
> not tell us what error it received, just that mlock() did not return 0.
> It also does not reveal which iteration of the loop it failed upon.
> 
> While I will not fix the bugs in the test that might alter the way
> it interacts with the kernel, I will update the test to get more
> information from the tests that fail (and check results that are
> currently not checked and make them available - but so as not to
> alter what happens, merely by output to stderr).
> 
> I'd appreciate it if someone who understands how the mappings work
> could make the worst possible assumptions about what jemalloc() is
> actually doing, or not doing, and see if the kernel bugs can be
> explained, including the crash that happens during the mlock_clip()
> test.

My guess is that jemalloc is simply returning non-page-aligned regions (I have not read the code).  Was our previous allocator an older jemalloc, or was it phkmalloc?

> There, in the i386 tests, the kernel writes just one of the
> printf messages about unwiring, then panics...
> 
> [ 6821.7909726] pmap_unwire: wiring for pmap 0xc1b9a000 va 0xac284000 did not change!
> [ 6821.7909726] panic: kernel diagnostic assertion "pg->wire_count > 0" failed: file "/tmp/bracket/build/2019.03.12.18.46.20-i386/src/sys/uvm/uvm_page.c", line 1594

THAT is particularly special, because the code in question is:

<snip>
void
uvm_pagewire(struct vm_page *pg)
{
        KASSERT(mutex_owned(&uvm_pageqlock));
#if defined(READAHEAD_STATS)
        if ((pg->pqflags & PQ_READAHEAD) != 0) {
                uvm_ra_hit.ev_count++;
                pg->pqflags &= ~PQ_READAHEAD;
        }
#endif /* defined(READAHEAD_STATS) */
        if (pg->wire_count == 0) {
                uvm_pagedequeue(pg);
                uvmexp.wired++;
        }
        pg->wire_count++;
        KASSERT(pg->wire_count > 0);    /* detect wraparound */
}
</snip>

What I would glean from that combination of messages:

From a previous iteration through the test loop, a non-wired region was unwired, the accounting was messed up, and uvm_pageunwire() was called on a page that was not, in fact, wired.  This causes the assert to fire the next time the page is wired -- it hasn't really wrapped around, but it underflowed!  Alas, uvm_pageunwire() does not have underflow detection:

<snip>
void
uvm_pageunwire(struct vm_page *pg)
{
        KASSERT(mutex_owned(&uvm_pageqlock));
        pg->wire_count--;
        if (pg->wire_count == 0) {
                uvm_pageactivate(pg);
                uvmexp.wired--;
        }
}
</snip>

I would suggest adding:

	KASSERT(pg->wire_count != 0);

Immediately before:

	pg->wire_count--;

...and running the test again.  Then hopefully we can catch the bogus unwire in the act!

> [ 6821.7909726] cpu0: Begin traceback...
> [ 6821.7909726] vpanic(c107f760,c9867d30,c9867d44,c08e21ce,c107f760,c107f69f,c11314a1,c11341b8,63a,c9867de4) at netbsd:vpanic+0x12d
> [ 6821.8109974] kern_assert(c107f760,c107f69f,c11314a1,c11341b8,63a,c9867de4,c9867d80,c08cfb14,c183ee98,ac284000) at netbsd:kern_assert+0x23
> [ 6821.8310116] uvm_pagewire(c183ee98,ac284000,6d44000,3,33,3,ac284000,0,c1b9a000,c9867e01) at netbsd:uvm_pagewire+0x80
> [ 6821.8511062] uvm_fault_upper_enter.isra.5(c2479490,c183ee98,c2479490,1,c08e5405,c183ee98,c9867e40,cfa787f1,c9867e28,c9867e28) at netbsd:uvm_fault_upper_enter.isra.5+0x1dd
> [ 6821.8511062] uvm_fault_internal(c221a9f4,ac284000,3,3,c221aa04,c22949ac,c221aa24,ac285000,c9867f0c,c08d8ab6) at netbsd:uvm_fault_internal+0x1500
> [ 6821.8710371] uvm_fault_wire(c221a9f4,ac284000,ac285000,3,1,0,58,c221aa24,c22949ac,1000) at netbsd:uvm_fault_wire+0x4f
> [ 6821.8710371] uvm_map_pageable(c221a9f4,ac284000,ac285000,0,0,ac285000,c2391d00,c9867fa8,c138899c,c9867f9c) at netbsd:uvm_map_pageable+0x1bc
> [ 6821.8911479] sys_mlock(c2391d00,c9867f68,c9867f60,c21985e4,c9865000,c9867f60,cb,c9867f68,0,0) at netbsd:sys_mlock+0xb9
> [ 6821.9110634] syscall() at netbsd:syscall+0x143
> [ 6821.9110634] --- syscall (number 203) ---
> [ 6821.9310749] ac2d0257:
> [ 6821.9310749] cpu0: End traceback...
> 
> 
> (Aside: note this is a recent test run and includes gson's missing
> space replacement fix!)
> 
> Also, an explanation of what might be hanging onto all that tmpfs memory
> after everything should have exited.
> 
> Thanks all,
> 
> kre
> 

-- thorpej
References:
- ATF t_mlock() babylon5 kernel panics
  - From: Robert Elz
Prev by Date: ATF t_mlock() babylon5 kernel panics
Next by Date: Re: Regarding the ULTRIX and OSF1 compats
Previous by Thread: ATF t_mlock() babylon5 kernel panics
Indexes:
Home | Main Index | Thread Index | Old Index