Subject: Re: CVS commit: src/sys/kern
To: Darren Reed <darrenr@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-i386
Date: 01/28/2006 15:49:57
--M9NhX3UHpAaciwkO
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Sat, Jan 28, 2006 at 01:56:28PM +0000, Darren Reed wrote:
> > So DDB_COMMANDONENTER="trace" would probably have worked as well.
> > Can you see where 0xc0100b7f points in your kernel ?
> > I suspect cpu_switch() or trap().
> 
> cpu_switch()
> 
> > If you use DDB_COMMANDONENTER="show registers; trace" (in this order as
> > trace doesn't completely work) we would also have the esp (stack pointer)
> > value.
> > 
> 
> This doesn't exist in 3.0

Sure, it's only in current. I assumed you were working on current.

> 
> > You could also try disas 0xc0100b7f
> 
> (gdb) disas cpu_switch+0x9f
> Dump of assembler code for function idle_loop:
> 0xc0100b73 <idle_loop>: mov    0xc0367668,%ecx
> 0xc0100b79 <idle_loop+6>:       test   %ecx,%ecx
> 0xc0100b7b <idle_loop+8>:       jne    0xc0100b63 <idle_zero>
> 0xc0100b7d <idle_loop+10>:      sti
> 0xc0100b7e <idle_loop+11>:      hlt
> 0xc0100b7f <idle_loop+12>:      nop
> End of assembler dump.

OK, same as in my kernel then.
I don't completely understand why this instruction would cause a fault.
Maybe it's hlt which is faulting trying to save some context on stack,
or maybe the CPU got an interrupt, and it's faulting trying to save the
current context on stack, while the intstruction pointer hasn't been updated
to point to the interrupt handler.

> 
> > My current theory is that cpu_switch(), while restoring a context, loaded
> > %esp with a bogus value.
> 
> Nasty.

Sure. And it's why debugging it is so hard: a stack trace can't work because
we have the wrong stack.
AFAIK idle_loop is only called when cpu_switch() didn't find any new task
to run, so we're using the idle PCB. This is the one who probably got
corrupted.

This PCB address is stored in cpu_info_primary (it's
cpu_info_primary.ci_idle_pcb). Either this address is
corrupted, or the memory this address points to is corrupted.

You can look at assym.h to get the offset of ci_idle_pcb in cpu_info_primary
(search for IDLE_PCB) and from here know the address of this pointer.

Them if you can enter ddb before you get the panic (boot -d should do it)
you can get the value of this pointer. Again assym.h will give you the
value of PCB_ESP, and so you'll know the address where the address of
the idle stack pointer is stored.

Once you know these 2 addresses, you can try to set a watchpoint on these
(watch <addr>) but this may not work :(
If this doens't work, you can try the attached patch (untested, but it
compiles). It will save these 2 values, and check that they didn't change
each time cpu_switch() is called.  This only works for non-MULTIPROCESSOR
kernels.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--

--M9NhX3UHpAaciwkO
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff

Index: i386/locore.S
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/i386/locore.S,v
retrieving revision 1.36
diff -u -r1.36 locore.S
--- i386/locore.S	11 Dec 2005 12:17:41 -0000	1.36
+++ i386/locore.S	28 Jan 2006 14:49:14 -0000
@@ -817,6 +817,8 @@
 1:
 #endif /* DEBUG */
 
+	call _C_LABEL(cpu_check_idelepcb);
+
 	movl	16(%esp),%esi		# current
 
 	/*
Index: i386/machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/i386/machdep.c,v
retrieving revision 1.569
diff -u -r1.569 machdep.c
--- i386/machdep.c	30 Dec 2005 13:37:57 -0000	1.569
+++ i386/machdep.c	28 Jan 2006 14:49:14 -0000
@@ -2341,3 +2341,32 @@
 	return (MAXGDTSIZ - NGDT);
 #endif
 }
+
+void cpu_check_idelepcb(void);
+
+void
+cpu_check_idelepcb()
+{
+	static void *idle_pcb = NULL;
+	static int esp = 0;
+
+	if (idle_pcb == NULL) {
+		idle_pcb = cpu_info_primary.ci_idle_pcb;
+		esp = cpu_info_primary.ci_idle_pcb->pcb_esp;
+	} else {
+		if (idle_pcb != cpu_info_primary.ci_idle_pcb) {
+			printf("idle_pcb changed %p -> %p\n",
+			    idle_pcb, cpu_info_primary.ci_idle_pcb);
+#ifdef DDB
+			Debugger();
+#endif
+		}
+		if (esp != cpu_info_primary.ci_idle_pcb->pcb_esp) {
+			printf("esp changed %x -> %x\n", esp,
+			    cpu_info_primary.ci_idle_pcb->pcb_esp);
+#ifdef DDB
+			Debugger();
+#endif
+		}
+	}
+}

--M9NhX3UHpAaciwkO--