NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/52252: uvm_km_check_empty panic when loading any module
The following reply was made to PR kern/52252; it has been noted by GNATS.
From: Anthony Mallet <tho%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost, kern-bug-people%netbsd.org@localhost
Cc:
Subject: Re: kern/52252: uvm_km_check_empty panic when loading any module
Date: Mon, 29 May 2017 13:26:51 +0200
On Wednesday 24 May 2017, at 23:52, Anthony Mallet wrote:
> It appear that backing out this change
[...]
> cvs rdiff -u -r1.21 -r1.22 src/sys/arch/amd64/conf/kern.ldscript
I did some progress on this.
To make a long story short, this patch fixes the issue for me.
If it looks OK, can somebody commit it? (see the discussion below for
a detailed explanation, and possible other fixes).
Index: sys/arch/amd64/amd64/locore.S
===================================================================
RCS file: /cvsroot/src/sys/arch/amd64/amd64/locore.S,v
retrieving revision 1.123
diff -u -r1.123 locore.S
--- sys/arch/amd64/amd64/locore.S 25 Mar 2017 15:07:21 -0000 1.123
+++ sys/arch/amd64/amd64/locore.S 29 May 2017 10:38:50 -0000
@@ -478,6 +478,8 @@
*/
cmpl $BTINFO_MODULELIST,4(%esi) /* btinfo_common::type */
jne bootinfo_copy
+ movl 8(%esi),%eax /* btinfo_modulelist::num */
+ jz bootinfo_copy
/* Skip the modules if we won't have enough VA to map them */
movl 12(%esi),%eax /* btinfo_modulelist::endpa */
--- The long story ------------------------------------------------
I noticed that different combination of kernel/hardware triggered the
issue in a quite random way, so I started to investigate why changing
kern.ldscript (as mentionned in my previous e-mail) was sometimes
fixing it.
On both failing and OK machines, I noticed an inconsistency between
some variables that are :
* __kernel_end: set by kern.ldscript, the end of the loaded image,
aligned on 2MB since the most recent commit in it.
* kern_end: set in sys/arch/amd64/amd64/machdep:1565:
/* End of the virtual space we have created so far. */
kern_end = (vaddr_t)atdevbase + IOM_SIZE;
'atdevbase' is computed in sys/arch/amd64/amd64/locore.S:871:
/* Relocate atdevbase. */
movq $(TABLESIZE+KERNBASE),%rdx
addq %rsi,%rdx
movq %rdx,_C_LABEL(atdevbase)(%rip)
In the absence of loaded symbols and modules by boot, atdevbase is
supposed to be __kernel_end + tablesize (%rsi points on the beginning
of BOOTSTRAP TABLES at this point), see e.g. the comment is
locore.S:592
But in gdb, on failing kernels, I had for instance this:
(gdb) p/x &__kernel_end
$1 = 0xffffffff80c00000
(gdb) p/x kern_end
$3 = 0xffffffff80c13000
(gdb) p/x atdevbase
$2 = 0xffffffff80bb3000
(gdb) p/x atdevbase - tablesize
$6 = 0xffffffff80b8b000
^ before __kernel_end!
On a OK kernel, it was still inconsistent, although safe:
(gdb) p/x &__kernel_end
$1 = 0xffffffff80aa6000
(gdb) p/x atdevbase - tablesize
$2 = 0xffffffff80b8b000
^ after __kernel_end, but inconsistent
Looking further in sys/arch/amd64/amd64/locore.S, one can see that
lines 617-637 compute the end of the bootstrap image by first
intializing it to __kernel_end, then taking the value of either "esym"
or "eblob" passed in bootinfo.
It seems that, eblob is never zero, because boot always passes a
struct btinfo_modulelist with a "num" member equal to 0 when no
modules are loaded by boot:
(gdb) p/x esym
$3 = 0xffffffff00000000 ( actually == NULL, only the 32LSB are tested)
(gdb) p/x eblob
$4 = 0x80b8b000
(gdb) p/x *(struct btinfo_modulelist *)&bootinfo->bi_data[0x14+0x58+0x18+0x20+0x3c+0x28+0x174]
$21 = {common = {len = 0x10, type = 0xb}, num = 0x0, endpa = 0xb8b000}
(gdb) p/x &__kernel_end
$22 = 0xffffffff80aa6000
So eblob is set as indicated in bootinfo data (this happens in
locore.S:491), but this overwrites __kernel_end, in locore.S:634
617 /* Find end of kernel image; brings us on (1). */
618 movl $RELOC(__kernel_end),%edi
[...]
629 /* Skip over any modules/blobs; brings us on (3). */
630 movl RELOC(eblob),%eax
631 testl %eax,%eax
632 jz 1f
633 subl $KERNBASE_LO,%eax /* XXX */
634 movl %eax,%edi
635 1:
At best, this disregards the 2MB alignment set up in
adm64/conf/kern.ldscript, at worst, module_start will point inside the
ISA hole or bootstrap tables
(module_start < __kernel_end + tablesize + IOM_SIZE).
My patch is simply not setting eblob in case the struct
btinfo_modulelist::num is 0, which matches the purpose of the eblob
variable as documented in locore.S:476:
/*
* If any modules were loaded, record where they end. 'eblob' is used
* later to compute the initial bootstrap tables.
*/
This sounds like the right thing when no modules are loaded by boot,
but I'm not sure what happens if modules are actually loaded.
I still did not get what was the value "endpa = 0xb8b000" in
the bootinfo data. This does not correspond to anything in my
netbsd.map. It is computed in sys/arch/i386/stand/lib/exec.c:350 by
the function common_load_kernel(), but I have not analyzed this part.
There is for instance a #ifdef XMS that I don't know yet if it used or
not, but this afaik is the only place where image_end is initialized
to something...
Also, I don't like what is done in locore.S in the lines 617-635
quoted above. Blindly overwriting __kernel_end seems a bit risky, no?
At least, there should be some consistency tests, e.g in pseudo code
%edi = __kernel_end
if (esym > %edi) %edi = esym
if (ebloc > %edi) %edi = eblob
What do you think?
Home |
Main Index |
Thread Index |
Old Index