Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Ultrasparc III+ kernel panic



Eduardo Horvath a écrit :
On Wed, 1 Apr 2015, BERTRAND Joël wrote:

	Hello,

	New panic last night...

1 tt=30 tstate=4411001505 tpc=0x1001488 tnpc=0x100148c
2 tt=30 tstate=4482000603 tpc=0x12e1da0 tnpc=0x12e1da4

Debug information :
(gdb) list *(0x1001488)
(gdb) x/i 0x1001488
    0x1001488 <uspillk4+8>:      sta  %l0, [ %sp ] %asi
(gdb) list *(0x12e1da0)
0x12e1da0 is in mutex_vector_enter (/usr/src/sys/kern/kern_mutex.c:440).
435      *      fast-path stubs are available.  If an mutex_spin_enter() stub
is
436      *      not available, then it is also aliased directly here.
437      */
438     void
439     mutex_vector_enter(kmutex_t *mtx)
440     {
441             uintptr_t owner, curthread;
442             turnstile_t *ts;
443     #ifdef MULTIPROCESSOR
444             u_int count;
(gdb) x/i 0x12e1da0
    0x12e1da0 <mutex_vector_enter>:      save  %sp, -176, %sp

mach stack does not return usable information. Only :
db{0} > mach stack
Window 0 frame64 0xe004ff50 locals, ins:
10426baa0 0 15a068000 1044914d0 fffffffffefa2000 0 102cfafd0 180f680
0 0 0 0 0 0 ffffffffffffa011=sp fffffffffed6d200=pc:fffffffffed6d200
Window 1 frame64 0xffffffffffffa810 locals, ins:

	You can see that this panic is exactly the same than last panic.

I looked at the archives and it doesn't look like I commented on this
previously.

I'm assuming the trap stack is semi-accurate.  The save instruction should
not be able to generate a data access fault, but then the low level bits
of locore.s do some interesting gymnastics with the trap stack to prevent
loss of data, so it may have moved things around.

uspillk4 is used to save alternate space register windows to the stack.
The order of operations is:

1) The CPU is running userland code and traps into the kernel.

2) The kernel switches to the kernel stack and moves the contents of
%canrestore to %otherwin to indicate those register windows are not of the
current address space.

3) The kernel does some stuff and eventually calls mutex_vector_enter().

4) mutex_vector_enter() needs a new register window, so it does a save.

5) The register windows are full, so the CPU takes a store window trap.
Since %otherwin is not zero, it goes to uspillk4 to save other address
space windows instead of kspill4.

6) The trap handler tries to save the window and takes a data fault.

7) The data fault handler punts.

What should happen is:

The CPU takes a save fault at trap level 1.

It takes a data fault at trap level 2.

The data fault handler jumps to winfault.  winfault will look at the
current trap level.  Since it's not 1, it executes some fancy code to
fiddle with the trap stack and figure out what's really happening.  It
should detect a fault during a spill and go to winfixspill.

winfixspill code should save all the otherwin windows to slots in the PCB,
and then continue executing kernel code.

Eventually, when returning to userland, the trap return code will restore
all the userland windows from the PCB and return to userland code.

winfix has a bunch of diagnostic code still enabled.  You do not seem to
be hitting any of the sir instructions sprinkled in the code that would
reset the box.

There's still a lot of debug and diagnostic code in there.  You might want
to try turning some of the NOT_DEBUG or NOTDEF_DEBUG code on.

OK. I understand, but I don't know locore.s enough to do some modifications. I think I will introduce more bugs I want to fix :-P

Also, look for calls to panic.  Line 2149 there's a ta 1, which will cause
a trap, before the call to panic.  That made sense when the kernel still
had traptrace, since that would generate a traptrace entry before all hell
broke loose.  Now it probably just makes things worse.  Try removing it
to really call panic there, or changing it to an sir instruction to
generate a reset.

There's another ta 1 on line 2306 to trap to the debugger.  Since trapping
to ddb is not reliable in this situation, change it to an sir instruction.

Anyway, you probably need to instrument that code path to see where it's
geting confused.

And keep in mind that code is semi-recursive in that you can take a
datafault trying to clean up state to take a data fault.

	I have seen. And I have seen another panic :

panic: cpu1: ipi_send: couldn't send ipi to UPAID 0 (tried 10000 times)
cpu1: Begin traceback...
cpu1: End traceback...
Frame pointer is at 0x2004e41
Call traceback:
netbsd:cpu_reboot+0x208(182f828, 1, ffff, 77bb78, 1cce380, 1c97000) fp = 2004f01
 netbsd:vpanic+0x178(104, 0, 1852638, 1cb6800, f, 1c70740) fp = 2004fb1
netbsd:panic+0x24(1852638, 20059a8, 1cdc800, 1cddaf8, 1cddc00, 104) fp = 2005061 netbsd:sparc64_send_ipi_sun4u+0x1ac(1852638, 1, 0, 2710, fffffffffffffffe, 0) fp = 2005121
 netbsd:cpu_need_resched+0x54(f4240, 1018a80, 0, 0, 70, 0) fp = 20051d1
netbsd:sched_changepri+0x64(2014000, 2, 2014000, 101db1d08, 101db1040, 2a) fp = 2005281 netbsd:resetpriority+0x90(1043816c0, 2a, 0, 1, 101daec40, 101daedc0) fp = 2005331 netbsd:sched_pstats+0x118(1043816c0, 0, 1c70868, 0, 10caf5510, 2a) fp = 20053e1 netbsd:uvm_scheduler+0x60(64, 1c71000, 0, 101daedc0, 10caf5510, 1043816c0) fp = 2005491 netbsd:main+0x83c(101d89f00, 1c70740, 1c70740, 101da2c80, 1c0a1fc, 18a0598) fp = 2005541 netbsd:cpu_initialize+0x154(184d500, 10624dd3, 1c97800, 0, 101daee00, 1) fp = 2005621 netbsd:100030+0(f0059840, 113800, 113c00, 111880, 111ce8, 1117f8) fp = fff33651

dumping to dev 25,1 offset 12291071

But I don't understand. With the same kernel, this Blade2000 rebooted one or more times _by day_ and now, uptime is greater than 8 days. I have saved kernel image and core if you want.

	Regards,

	JKB


Home | Main Index | Thread Index | Old Index