Re: Ultrasparc III+ kernel panic

To: Eduardo Horvath <eeh%NetBSD.org@localhost>
Subject: Re: Ultrasparc III+ kernel panic
From: BERTRAND Joël <joel.bertrand%systella.fr@localhost>
Date: Mon, 13 Apr 2015 18:01:48 +0200

Eduardo Horvath a écrit :

On Wed, 1 Apr 2015, BERTRAND Joël wrote:

	Hello,

	New panic last night...

1 tt=30 tstate=4411001505 tpc=0x1001488 tnpc=0x100148c
2 tt=30 tstate=4482000603 tpc=0x12e1da0 tnpc=0x12e1da4

Debug information :
(gdb) list *(0x1001488)
(gdb) x/i 0x1001488
    0x1001488 <uspillk4+8>:      sta  %l0, [ %sp ] %asi
(gdb) list *(0x12e1da0)
0x12e1da0 is in mutex_vector_enter (/usr/src/sys/kern/kern_mutex.c:440).
435      *      fast-path stubs are available.  If an mutex_spin_enter() stub
is
436      *      not available, then it is also aliased directly here.
437      */
438     void
439     mutex_vector_enter(kmutex_t *mtx)
440     {
441             uintptr_t owner, curthread;
442             turnstile_t *ts;
443     #ifdef MULTIPROCESSOR
444             u_int count;
(gdb) x/i 0x12e1da0
    0x12e1da0 <mutex_vector_enter>:      save  %sp, -176, %sp

mach stack does not return usable information. Only :
db{0} > mach stack
Window 0 frame64 0xe004ff50 locals, ins:
10426baa0 0 15a068000 1044914d0 fffffffffefa2000 0 102cfafd0 180f680
0 0 0 0 0 0 ffffffffffffa011=sp fffffffffed6d200=pc:fffffffffed6d200
Window 1 frame64 0xffffffffffffa810 locals, ins:

	You can see that this panic is exactly the same than last panic.


I looked at the archives and it doesn't look like I commented on this
previously.

I'm assuming the trap stack is semi-accurate.  The save instruction should
not be able to generate a data access fault, but then the low level bits
of locore.s do some interesting gymnastics with the trap stack to prevent
loss of data, so it may have moved things around.

uspillk4 is used to save alternate space register windows to the stack.
The order of operations is:

1) The CPU is running userland code and traps into the kernel.

2) The kernel switches to the kernel stack and moves the contents of
%canrestore to %otherwin to indicate those register windows are not of the
current address space.

3) The kernel does some stuff and eventually calls mutex_vector_enter().

4) mutex_vector_enter() needs a new register window, so it does a save.

5) The register windows are full, so the CPU takes a store window trap.
Since %otherwin is not zero, it goes to uspillk4 to save other address
space windows instead of kspill4.

6) The trap handler tries to save the window and takes a data fault.

7) The data fault handler punts.

What should happen is:

The CPU takes a save fault at trap level 1.

It takes a data fault at trap level 2.

The data fault handler jumps to winfault.  winfault will look at the
current trap level.  Since it's not 1, it executes some fancy code to
fiddle with the trap stack and figure out what's really happening.  It
should detect a fault during a spill and go to winfixspill.

winfixspill code should save all the otherwin windows to slots in the PCB,
and then continue executing kernel code.

Eventually, when returning to userland, the trap return code will restore
all the userland windows from the PCB and return to userland code.

winfix has a bunch of diagnostic code still enabled.  You do not seem to
be hitting any of the sir instructions sprinkled in the code that would
reset the box.

There's still a lot of debug and diagnostic code in there.  You might want
to try turning some of the NOT_DEBUG or NOTDEF_DEBUG code on.

OK. I understand, but I don't know locore.s enough to do somemodifications. I think I will introduce more bugs I want to fix :-P

Also, look for calls to panic.  Line 2149 there's a ta 1, which will cause
a trap, before the call to panic.  That made sense when the kernel still
had traptrace, since that would generate a traptrace entry before all hell
broke loose.  Now it probably just makes things worse.  Try removing it
to really call panic there, or changing it to an sir instruction to
generate a reset.

There's another ta 1 on line 2306 to trap to the debugger.  Since trapping
to ddb is not reliable in this situation, change it to an sir instruction.

Anyway, you probably need to instrument that code path to see where it's
geting confused.

And keep in mind that code is semi-recursive in that you can take a
datafault trying to clean up state to take a data fault.


	I have seen. And I have seen another panic :

panic: cpu1: ipi_send: couldn't send ipi to UPAID 0 (tried 10000 times)
cpu1: Begin traceback...
cpu1: End traceback...
Frame pointer is at 0x2004e41
Call traceback:

netbsd:cpu_reboot+0x208(182f828, 1, ffff, 77bb78, 1cce380, 1c97000) fp= 2004f01

 netbsd:vpanic+0x178(104, 0, 1852638, 1cb6800, f, 1c70740) fp = 2004fb1

netbsd:panic+0x24(1852638, 20059a8, 1cdc800, 1cddaf8, 1cddc00, 104) fp= 2005061netbsd:sparc64_send_ipi_sun4u+0x1ac(1852638, 1, 0, 2710,fffffffffffffffe, 0) fp = 2005121

 netbsd:cpu_need_resched+0x54(f4240, 1018a80, 0, 0, 70, 0) fp = 20051d1

netbsd:sched_changepri+0x64(2014000, 2, 2014000, 101db1d08, 101db1040,2a) fp = 2005281netbsd:resetpriority+0x90(1043816c0, 2a, 0, 1, 101daec40, 101daedc0)fp = 2005331netbsd:sched_pstats+0x118(1043816c0, 0, 1c70868, 0, 10caf5510, 2a) fp= 20053e1netbsd:uvm_scheduler+0x60(64, 1c71000, 0, 101daedc0, 10caf5510,1043816c0) fp = 2005491netbsd:main+0x83c(101d89f00, 1c70740, 1c70740, 101da2c80, 1c0a1fc,18a0598) fp = 2005541netbsd:cpu_initialize+0x154(184d500, 10624dd3, 1c97800, 0, 101daee00,1) fp = 2005621netbsd:100030+0(f0059840, 113800, 113c00, 111880, 111ce8, 1117f8) fp =fff33651


dumping to dev 25,1 offset 12291071

But I don't understand. With the same kernel, this Blade2000 rebootedone or more times _by day_ and now, uptime is greater than 8 days. Ihave saved kernel image and core if you want.


	Regards,

	JKB

Follow-Ups:
- Re: Ultrasparc III+ kernel panic
  - From: Eduardo Horvath

References:
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Martin Husemann
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Takeshi Nakayama
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Martin Husemann
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Martin Husemann
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Eduardo Horvath

Prev by Date: Re: Ultrasparc III+ kernel panic
Next by Date: Re: Ultrasparc III+ kernel panic
Previous by Thread: Re: Ultrasparc III+ kernel panic
Next by Thread: Re: Ultrasparc III+ kernel panic
Indexes:

Home | Main Index | Thread Index | Old Index