Re: ATF t_mlock() babylon5 kernel panics

To: Jason Thorpe <thorpej%me.com@localhost>
Subject: Re: ATF t_mlock() babylon5 kernel panics
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Thu, 14 Mar 2019 19:44:32 +0700
    Date:        Wed, 13 Mar 2019 11:44:51 -0700
    From:        Jason Thorpe <thorpej%me.com@localhost>
    Message-ID:  <9A2A4A34-35B0-490E-9A92-AAB44174FD66%me.com@localhost>


  | I would suggest instrumenting-with-printf the "new_pageable"
  | case of uvm_map_pageable()

That didn't show much we didn't already know.

Turns out the problem is in the other side, the mlock() case,
that's where things go bad, then the munlock() falls foul of
the corrupted setup.   I haven't tested it yet, but I will,
but I am fairly sure that simply doing munlock(addr, 0) when
there has been no mlock() will cause no problems at all.

I am less sure about the case
	mlock(addr, 100);  munlock(addr+50, 0);
but I will investigate that one too.

What happens in the mlock(addr, 0) case is this:

uvm_map_pageable() starts by validating the start and end addresses
(ed6000 for both, since size == 0 ... that is the real address from
the test, but with all the upper bits dropped for the e-mail, they're
irrelevant).

uvm_map_lookup_entry() finds the entry for the start address, which
in this case is ecd000..eff000 (start..end) with wired_count == 0.

Since this is the !new_pageable case we drop to pass 1 of the "do wiring"
code.  There entry != &map->header (the end ... that's impossible in
the first iteration I think) and entry->start (ecd000) < end (ed6000)
so we enter the loop.

The map isn't wired (wired_count == 0) so we do the test to see
if an amap_copy() is needed.   It isn't, fortunately, or I'd have
to go learn that that's all about...


Then we get to the first interesting bit;

In general, if we have a map entry covering addresses a..b and
we want to lock a range x..y where a<x<y<b what happens is that
the two calls

                UVM_MAP_CLIP_START(map, entry, start);
                UVM_MAP_CLIP_END(map, entry, end);

split the map entry into three.   The first line makes a..x and x..b
with "entry" becoming x..b, and then the second divides that one,
into x..y and y..b with entry becoming x..y

  [Aside here, we know a<=x that's what the initial uvm_map_lookup_entry()
   did for us, and we know a<b (a property of the map entries) and x<=y (I
   think we know that, I did not yet look to see if something checks for a
   negative length - or a very big one which would cause wrap around of the
   end=start+length addition.   Nothing else is known.]
  [2nd aside added later:  I did check for a wraparound, check,
   VM_MAP_RANGE_CHECK() forces start = end in that case.   I think that's
   wrong, it should be an error, not a "wire nothing" which currently breaks
   things, but with the suggested changes way way down below, would result
   in a success return.   If we were to keep what VM_MAP_RANGE_CHECK() does,
   it should be "end = start" rather than the opposite which it does now.
   It is "start" that the sys call requested, end was computed from the size.]

In all of these ignore my fudging of the border addresses - nothing
is really in two maps (or to be precise, that's how it is represented,
but the start..end means addresses start .. end-1 -- none of this is
relevant.)

That's what is intended to happen - if it happened that we had a==x or y==b
we'd just get two map entries, and if both a==x and y==b nothing would be
split and the entry we started with is the one that needs wiring or
unwiring.

But nothing here considered the possibility that x==y which is what we have.

In this case, the UVM_MAP_CLIP_START divides the entry into two, a..x and x..b
just as above, as x != a, and that's fine.

But the UVM_MAP_CLIP_END (just like UVM_MAP_CLIP_START) does

        if ((VA) > (ENTRY)->start && (VA) < (ENTRY)->end) { \
                uvm_map_clip_end(MAP,ENTRY,VA); \
        } \

(that's a macro definition, hence the \'s).   Here ENTRY->start == x,
ENTRY->end == b, and VA == y.   But x == y, so VA == x too.

Then since VA == ENTRY->start the condition fails, and no splitting
(clipping) is done.

So, now we have two entries  [ecd000..ed6000] and [ed6000..eff000]
with "entry" (the one we're going to work on) being the second of
those (it is the one that contains the addresses we want to lock)
whereas the first is now irrelevant, untouched, and we can (and will)
simply forget about.   The second one gets its wired_count incremented
(which in our case moves it from 0 to 1.)

That's where things really start to go wrong, as that's supposed to
mean that the addresses [ed6000..eff000] are wired.   That might be
OK if we actually wired them (yet to come) but it is not what was
requested.

There is then a check for holes, not really sure what that's about,
but if it happens we get an error (I'm guessing it indicates that we
have the situation a<x<b<y and there is no later map entry covering
the b..y address range).   Never mind, that doesn't happen here, so
we can ignore all of that as well.

From this we go onto the next map entry, which in our case isn't
interesting (cannot possibly be: either there are no more, or the
next has a value for its ->start which is >= eff000, which is > end
(ed6000) so one way or the other, the loop that is pass1 terminates.

On to pass 2.

The entry we start at remains the [ed6000..eff000] which now has
wired_count == 1.

The pass2 loop is controlled by

        while (entry != &map->header && entry->start < end) {

The first part of that is not an issue, and like in pass1, cannot
ever happen the first time around.   But:  entry->start == ed6000,
and end == ed6000, entry->start is not < end, and so we never
enter the loop.

Oops.

That loop is what actually makes the pages wired, and we're doing
none of that.

Pass2 loop "done", rv remains 0 (it only gets set inside the loop,
where we never go) so we simply return.

mlock() "succeeded".

And now, for the lead in to the panic (new KASSERT failed) when
we come to munlock()...  (but of course, it is not simple!)

We enter uvm_map_pageable() again, with start==end==ed6000 since
we are doing an munlock(same addr as locked, 0).

As nothing else has affected this process, uvm_map_lookup_entry() finds
the entry we were just working on [ed6000..eff000] - the entry that
contains the start addr (just the same as last time).  But now that
entry has wired_count == 1 (which happened just a bit earlier in this
message.)

The new_pagable case starts with

	    UVM_MAP_CLIP_START(map, entry, start);

to strip off any leading section of the map entry, that we are
not unlocking, but here start == entry->start so the unwire begins
at the beginning of the map entry.  That call changes nothing.

As with the wiring case, there are 2 loops, loop 1

          while ((entry != &map->header) && (entry->start < end)) {  

Just like before the first condition is impossible for the first
iteration, but for the second, entry->start == end.

Loop done before it started.   Doesn't matter, all that loop does
is validity checking on the range of addresses to be unlocked (just
like the "hole" check above I think).   Our address range is valid
(well, kind of).

On to the second loop.

          while ((entry != &map->header) && (entry->start < end)) {

Well, that's boring.   Another loop that does nothing.

munlock() is finished, successfully.   Note that nothing changed the
entry's wired_count - it remains 1, even though it has nominally been
unlocked.

Oops.

The wired count would have been set to 0 in uvm_map_entry_unwire()
which is called in the second unwire loop, if the entry indicated it
was wired (wired_count > 0).   But we never went there.

Note no panic yet...

Next thing the test does is  mlock(buf, 1024); munlock(buf, 1024).

I did not get diagnostics from these (all my debug output was
conditioned on start==end so I wouldn't be swamped by noise, I
considered that an unlikely enough case that I'd get to see what
I was looking for and nothing else .. which worked.)

But even without seeing what actually happened, it is fairly
obvious from reading the code and knowing what numbers are being
used...

mlock() calls uvm_map_pageable() again, this time start==ed6000
and end==ed7000 (not ed6400 as the size is rounded up to a page).

uvm_map_lookup_entry() finds the entry containing the start addr,
our good friend [ed6000..eff000] which has retained its wired_count==1

Pass 1 of the wired case.

        while ((entry != &map->header) && (entry->start < end)) {

entry->start < end, so we enter the loop.   This time the entry
is marked as already wired (though it is not) so we skip that
amap() required test completely.

                UVM_MAP_CLIP_START(map, entry, start);
                UVM_MAP_CLIP_END(map, entry, end);

We do that again.   This time we have a == x < y < b
so the UVM_MAP_CLIP_START() does nothing, but the UVM_MAP_CLIP_END()
does divide the entry into two  [ed6000..ed7000] and [ed7000..eff000]

Both have wired_count == 1 (that isn't changed by the map entry split.)
"entry" is the first of them, that's the one that contains (in this case,
*is*) the range of addresses we're wiring.   Wiring those increases the
wired_count wich becomes 2 for that entry.  (Incidentally, I'm not sure
wy we bother counting the number of wirings, it just goes 1 2 3 ... as
we mlock() the region over and over again - and then drops to 0 as soon
as we munlock() - one munlock() undoes any number of mlock() calls on
the same address range.   Given that I'd expect that all we'd need is
a simple "is wired" flag, rather than a count (and avoid the issue of
wired_count wraparound that was raised as a possibility - would only
take 4 billion mlock() calls for that to happen (2 billion to become
negative, but nothing ever checks for that, so we can consider that it
is effectively unsigned) - that's probably only a few minutes of a
	 for(;;) mlock(addr, size);
loop...

But I haven't examined all the code to see if there is anything anywhere
which actually ever cares about wired_count being either 0 or 1.  [See below].
(nb: this is different than the lower level page wiring counts, those
are shared between processes I believe, and one process can munlock()
a shared page while another keeps the same page wired down ... but
to overflow one of those counters you'd need 64 thousand simultaneous
maps referring to the same locked page - which seems like a scenario
that would be difficult to fabricate.)

Anyway, back to pass 1...  After splitting the entry, and incrementing
wired_count of the first one (the one we know we are wiring down - the
other will be handled next iteration of the loop) we check for holes
again, still don't find one, and go onto the next loop iteration.

This time we know there's another map entry (we just made it) so we're
not at the end of the list, but entry->start == ed7000 and end == ed7000
so !(entry->start < end) and the loop terminates.   That's all correct.

Pass 2.

We revert to our start_entry (the [ed6000..ed7000] entry now), and

        while (entry != &map->header && entry->start < end) {

end remains ed7000 so entry->start < end (and as always, on the first
iteration, the initial test cannot be true) so we enter the loop.

                if (entry->wired_count == 1) {

it's not, it is 2 now.   OK, here is a case where wired_count > 1
is relevant.   So there are 3 values of interest, 0 (unwired), 1
(just made wired) and 2 (was already wired down, but we have
a request to do it again).   Values > 2 don't seem interesting however.

Anyway we don't enter that block of code -- that's the block that
actually arranges for the pmap's to be marked as wired, so we have
still done none of that.

Oops.

That's the whole of the pass2 loop, so we go onto the next entry.
This one is [ed7000..eff000] (wired_count == 1).   But here
entry->start (ed7000) == end (ed7000) so is not < end, so the
loop is done.

That's it, the mlock() has succeeded again, and still has not
actually wired down a single page (just made more of a mess of
the uvm map entries).

Next the test's call of munlock(addr, 1024);

Again, the size is rounded up to a page, so we enter uvm_map_pageable()
with start==ed6000 and end==ed7000

Again uvm_map_lookup_entry() finds the [ed6000..ed7000] entry, with
write_count == 2 now.   That's "correct".

The new-pagable case (unwiring) starts with

                UVM_MAP_CLIP_START(map, entry, start);

which has nothing to do this time, entry->start == start.

Then we enter the 1st unwire loop

                while ((entry != &map->header) && (entry->start < end)) {  

and here entry->start < end so we actually get inside this loop
this time.   But just like last time, all this loop does is validate
the args against the map, and they're fine.

Next iteration, the [ed7000..eff000] entry, entry->start == end, so
the loop is done.

Loop 2.

Back to the start_entry [ed6000..ed7000]

                while ((entry != &map->header) && (entry->start < end)) {

same loop condition, so just like last time, we either run both loops,
or neither, and this time it is both.

                        UVM_MAP_CLIP_END(map, entry, end);

nothing to clip (split) entry->end == end

                        if (VM_MAPENT_ISWIRED(entry))
                                uvm_map_entry_unwire(map, entry);

The entry is wired (wired_count == 2) so we call uvm_map_entry_unwire()
this time.

That function is

        entry->wired_count = 0;
        uvm_fault_unwire_locked(map, entry->start, entry->end);

(that's all of it).   So we set the entry's wired_count back to 0
and call uvm_fault_unwire_locked()

uvm_fault_unwire_locked() does a lookup on the start addr
(uvm_map_lookup_entry() same as earlier) whch will find our
familiar [ed6000..ed7000] entry, and then loops ...

        for (va = start; va < end; va += PAGE_SIZE) {

(that will loop exactly once, as our entry is 1 page long)

                if (pmap_extract(pmap, va, &pa) == false)
                        continue;

Find the physical addr.  That should work OK, the page exists.

Then we have some validity checking (KASSERT() - none of which
fire, so they're all OK) and fiddly locking stuff (which for now
we can assume "just works" - and is irrelevant to this essay.)

After that

                if (VM_MAPENT_ISWIRED(entry) == 0)
                        pmap_unwire(pmap, va);

Since uvm_map_entry_unwire() set wired_count for our entry to 0,
the condition is true and we call pmap_unwire().

That is, I believe, the source of the

pmap_unwire: wiring for pmap 0xffffaa80028e4980 va 0x7f7ff7ed6000 did not change!

message (note the ed6000) in the message.   pmap_unwire() is very confused
about being asked to make no changes.   But of itself, that seems to be
relatively harmless (noisy but harmless).

Next:
                pg = PHYS_TO_VM_PAGE(pa);
                if (pg)
                        uvm_pageunwire(pg);

pg != 0 (which I know, only because I know that uvm_pageunwire() gets
called) as that's where that new KASSERT() was added...

In it:

        KASSERT(pg->wire_count != 0);
        pg->wire_count--;

pg->wire_count == 0 - as in all of this, we have still never actually
wired a single page.    Before the KASSERT (or when running without
DIAGNOSTIC defined) the wire_count goes negative (really, wraps to 65535
since it is a uint16_t) and we get an even bigger mess than before,
which is what was leading to the consumption of all of the available VM
in a !DIAGNOSTIC kernel, so no KASSERT()  (after several more iterations
of mlock/munlook loops continually working on bogus data).

Anyway, that KASSERT is the panic().

I believe all of these problems go away if we simply do nothing in the
size == 0 case.   All the problems started with that first mlock(addr,0)
call which made a uvm map entry marked as wired, but wired no pages.

However, I think we can do better than nothing.   We need 2 changes I
believe.   First in sys_mlock() and sys_munlock() we change:

        size += pageoff;
        size = (vsize_t)round_page(size);

into

	if (size != 0) {
              size += pageoff;
              size = (vsize_t)round_page(size);
	}

So, and attempt to mlock 0 bytes remains an attempt to
mlock 0 bytes.   (o is guaranteed to be aligned).  If
we're locking nothing in the page, it doesn't really matter
which nothing it is.

This change isn't really required to fix the problem but does
avoid locking a page, when the program asked to lock nothing,
in cases where the addr is not page aligned.

I'd prefer doing it this way than

	if (size == 0)
		return XXX:

where XXX is either EINVAL or 0, depending on your state of
mind at the time, as (particularly if we decide size==0 is
harmless, and just return 0) we would lose all the checks
on the validity of addr.

We should probably add (after that block)

	if (end < start)
		return EINVAL;

(or some error code that this function is permitted to return).
That will handle the
	mlock(addr, 0xffffff000);
type problem (I didn't count the f's, but enough so that this is
just 1 (or 2) pages < the maximum possible size_t) and is better than
allowing VM_MAP_RANGE_CHECK() to simply reduce the range size to 0.

Then in uvm_map_pageable() after the:

        if (uvm_map_lookup_entry(map, start, &start_entry) == false) {
                if ((lockflags & UVM_LK_EXIT) == 0)
                        vm_map_unlock(map);

                UVMHIST_LOG(maphist,"<- done (fault)",0,0,0,0);
                return EFAULT;
        }

block, add

	if (start == end) {
		if ((lockflags & UVM_LK_EXIT) == 0)
                        vm_map_unlock(map);

                UVMHIST_LOG(maphist,"<- done (nothing)",0,0,0,0);
		return 0;
	}

I am guessing at the UVMHIST_LOG() stuff (copy/paste/edit...)

Does all of this sound plausible, and reasonable to do ?
Please do remember, that before this, I have never been
anywhere near uvm or anything pmap related, so have mercy!

kre
Follow-Ups:
- Re: ATF t_mlock() babylon5 kernel panics
  - From: Robert Elz
- Re: ATF t_mlock() babylon5 kernel panics
  - From: Jason Thorpe
References:
- Re: ATF t_mlock() babylon5 kernel panics
  - From: Jason Thorpe
- Re: ATF t_mlock() babylon5 kernel panics
  - From: Robert Elz
- Re: ATF t_mlock() babylon5 kernel panics
  - From: Jason Thorpe
- ATF t_mlock() babylon5 kernel panics
  - From: Robert Elz
- Re: ATF t_mlock() babylon5 kernel panics
  - From: Robert Elz
Prev by Date: Re: zsh crash in recent -current
Next by Date: strange libc symbol renaming issue
Previous by Thread: Re: ATF t_mlock() babylon5 kernel panics
Next by Thread: Re: ATF t_mlock() babylon5 kernel panics
Indexes:
Home | Main Index | Thread Index | Old Index