NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/52252: uvm_km_check_empty panic when loading any module



The following reply was made to PR kern/52252; it has been noted by GNATS.

From: Anthony Mallet <tho%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost, kern-bug-people%netbsd.org@localhost
Cc: 
Subject: Re: kern/52252: uvm_km_check_empty panic when loading any module
Date: Mon, 29 May 2017 13:26:51 +0200

 On Wednesday 24 May 2017, at 23:52, Anthony Mallet wrote:
 > It appear that backing out this change
 [...]
 > cvs rdiff -u -r1.21 -r1.22 src/sys/arch/amd64/conf/kern.ldscript
 
 I did some progress on this.
 To make a long story short, this patch fixes the issue for me.
 If it looks OK, can somebody commit it? (see the discussion below for
 a detailed explanation, and possible other fixes).
 
 Index: sys/arch/amd64/amd64/locore.S
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/amd64/amd64/locore.S,v
 retrieving revision 1.123
 diff -u -r1.123 locore.S
 --- sys/arch/amd64/amd64/locore.S	25 Mar 2017 15:07:21 -0000	1.123
 +++ sys/arch/amd64/amd64/locore.S	29 May 2017 10:38:50 -0000
 @@ -478,6 +478,8 @@
  	 */
  	cmpl	$BTINFO_MODULELIST,4(%esi) /* btinfo_common::type */
  	jne	bootinfo_copy
 +	movl	8(%esi),%eax		/* btinfo_modulelist::num */
 +	jz	bootinfo_copy
 
  	/* Skip the modules if we won't have enough VA to map them */
  	movl	12(%esi),%eax		/* btinfo_modulelist::endpa */
 
 
 --- The long story ------------------------------------------------
 
 I noticed that different combination of kernel/hardware triggered the
 issue in a quite random way, so I started to investigate why changing
 kern.ldscript (as mentionned in my previous e-mail) was sometimes
 fixing it.
 
 On both failing and OK machines, I noticed an inconsistency between
 some variables that are :
 
 * __kernel_end:  set by kern.ldscript, the end of the loaded image,
    aligned on 2MB since the most recent commit in it.
 
 * kern_end: set in sys/arch/amd64/amd64/machdep:1565:
 	/* End of the virtual space we have created so far. */
 	kern_end = (vaddr_t)atdevbase + IOM_SIZE;
 
 'atdevbase' is computed in sys/arch/amd64/amd64/locore.S:871:
 	/* Relocate atdevbase. */
 	movq	$(TABLESIZE+KERNBASE),%rdx
 	addq	%rsi,%rdx
 	movq	%rdx,_C_LABEL(atdevbase)(%rip)
 
 In the absence of loaded symbols and modules by boot, atdevbase is
 supposed to be __kernel_end + tablesize (%rsi points on the beginning
 of BOOTSTRAP TABLES at this point), see e.g. the comment is
 locore.S:592
 
 But in gdb, on failing kernels, I had for instance this:
 (gdb) p/x &__kernel_end
 $1 = 0xffffffff80c00000
 (gdb) p/x kern_end
 $3 = 0xffffffff80c13000
 (gdb) p/x atdevbase
 $2 = 0xffffffff80bb3000
 (gdb) p/x atdevbase - tablesize
 $6 = 0xffffffff80b8b000
                ^ before __kernel_end!
 
 On a OK kernel, it was still inconsistent, although safe:
 (gdb) p/x &__kernel_end
 $1 = 0xffffffff80aa6000
 (gdb) p/x atdevbase - tablesize
 $2 = 0xffffffff80b8b000
                ^ after __kernel_end, but inconsistent
 
 Looking further in sys/arch/amd64/amd64/locore.S, one can see that
 lines 617-637 compute the end of the bootstrap image by first
 intializing it to __kernel_end, then taking the value of either "esym"
 or "eblob" passed in bootinfo.
 
 It seems that, eblob is never zero, because boot always passes a
 struct btinfo_modulelist with a "num" member equal to 0 when no
 modules are loaded by boot:
 
 (gdb) p/x esym
 $3 = 0xffffffff00000000 ( actually == NULL, only the 32LSB are tested)
 (gdb) p/x eblob
 $4 = 0x80b8b000
 (gdb) p/x *(struct btinfo_modulelist *)&bootinfo->bi_data[0x14+0x58+0x18+0x20+0x3c+0x28+0x174]
 $21 = {common = {len = 0x10, type = 0xb}, num = 0x0, endpa = 0xb8b000}
 (gdb) p/x &__kernel_end
 $22 = 0xffffffff80aa6000
 
 So eblob is set as indicated in bootinfo data (this happens in
 locore.S:491), but this overwrites __kernel_end, in locore.S:634
 
 617	/* Find end of kernel image; brings us on (1). */
 618	movl	$RELOC(__kernel_end),%edi
 [...]
 629	/* Skip over any modules/blobs; brings us on (3). */
 630	movl	RELOC(eblob),%eax
 631	testl	%eax,%eax
 632	jz	1f
 633	subl	$KERNBASE_LO,%eax	/* XXX */
 634	movl	%eax,%edi
 635 1:
 
 At best, this disregards the 2MB alignment set up in
 adm64/conf/kern.ldscript, at worst, module_start will point inside the
 ISA hole or bootstrap tables
 (module_start < __kernel_end + tablesize + IOM_SIZE).
 
 My patch is simply not setting eblob in case the struct
 btinfo_modulelist::num is 0, which matches the purpose of the eblob
 variable as documented in locore.S:476:
 	/*
 	 * If any modules were loaded, record where they end. 'eblob' is used
 	 * later to compute the initial bootstrap tables.
 	 */
 
 This sounds like the right thing when no modules are loaded by boot,
 but I'm not sure what happens if modules are actually loaded.
 
 I still did not get what was the value "endpa = 0xb8b000" in
 the bootinfo data. This does not correspond to anything in my
 netbsd.map. It is computed in sys/arch/i386/stand/lib/exec.c:350 by
 the function common_load_kernel(), but I have not analyzed this part.
 There is for instance a #ifdef XMS that I don't know yet if it used or
 not, but this afaik is the only place where image_end is initialized
 to something...
 
 Also, I don't like what is done in locore.S in the lines 617-635
 quoted above. Blindly overwriting __kernel_end seems a bit risky, no?
 At least, there should be some consistency tests, e.g in pseudo code
 
 	%edi = __kernel_end
         if (esym > %edi) %edi = esym
         if (ebloc > %edi) %edi = eblob
 
 What do you think?
 



Home | Main Index | Thread Index | Old Index