Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Ultrasparc III+ kernel panic




Eduardo Horvath a écrit :
On Tue, 24 Feb 2015, BERTRAND Joël wrote:

Eduardo Horvath a écrit :
On Tue, 24 Feb 2015, BERTRAND Joël wrote:

matthew green a écrit :
Hm.  From what I remember, f000xxxx is inside OBP.

that's correct :-)

Instead of randomly swapping out hardware you really should try to
diagnose the problem.  I'd turn on ddb and traptrace in the kernel and
examine the contents of the traptrace buffer after the panic.  That
should
tell us the sequence of traps that caused the panic.

FWIW, traptrace never was updated for SMP.


	Will there a hope to quickly have a fix to obtain traptrace in syslog
? I'm trying to reproduce this bug on Blade 2000 I have at home without
any
success.

Putting traptrace back in is not trivial.  It basically involves taking
all of the traptrace code that was removed in locore.s version 1.214,
enhancing it for SMP, and reinserting it into locore.s.  How good are your
SPARC assembly language skills?

	I haven't written sparc assembly for a very long time (and only on
sparc32...) :-(

	I can try to do something, but I'm not sure I have required knowledge
to do that without help.

I can give you some advice, but I don't have the time or easy access to
the hardware to re-implement traptrace.

Take a look at the diffs between locore.s versions 1.213 and 1.214.  Some
of that code needs to be added back.  The first thing to do is rewrite
this TRACEIT macro:

-#define	TRACEIT(tt,r3,r4,r2,r6,r7)					
\
-	set	trap_trace, r2;						\
-	lduw	[r2+TRACEDIS], r4;					\
-	brnz,pn	r4, 1f;							\
-	 lduw	[r2+TRACEPTR], r3;					\
-	rdpr	%tl, r4;						\
-	cmp	r4, 1;							\
-	sllx	r4, 13, r4;						\
-	rdpr	%pil, r6;						\
-	or	r4, %g5, r4;						\
-	mov	%g0, %g5;						\
-	andncc	r3, (TRACESIZ-1), %g0;	/* At end of buffer? */		\
-	sllx	r6, 9, r6;						\
-	or	r6, r4, r4;						\
-	movnz	%icc, %g0, r3;		/* Wrap buffer if needed */	\
-	rdpr	%tstate, r6;						\
-	rdpr	%tpc, r7;						\
-	sth	r4, [r2+r3];						\
-	inc	2, r3;							\
-	sth	%g5, [r2+r3];						\
-	inc	2, r3;							\
-	stw	r6, [r2+r3];						\
-	inc	4, r3;							\
-	stw	%sp, [r2+r3];						\
-	inc	4, r3;							\
-	stw	r7, [r2+r3];						\
-	inc	4, r3;							\
-	mov	TLB_TAG_ACCESS, r7;					\
-	ldxa	[r7] ASI_DMMU, r7;					\
-	stw	r7, [r2+r3];						\
-	inc	4, r3;							\
-	stw	r3, [r2+TRACEPTR];					\
-1:

What the code does is check the contents of TRACEDIS.  If it's zero, it
loads TRACEPTR, writes a bunch of stuff to the buffer, and updates
TRACEPTR.

To simplify adding fields to the traptrace structure I wrote the code as a
series of stores and pointer increments.  Instead of that, it needs to be
written as a single pointer increment followed by the store operations.

Then get rid of the last instruction that updates TRACEPTR, instead
creating a spinloop at the beginning that looks something like this:

-        lduw   [r2+TRACEPTR], r3;                                      \
+0:
+        add	r2, TRACEPTR, r4;
+	lduw	[r4], r3;	/* Load the offset of the next slot */
+	add	r3, ENTRY_SIZE /* <- Needs to be calculated */, r6; /* Allocate */
+	cas	[r4], r6, r7;
+	cmp	r6, r7;
+	bne,pn	%icc, 0b;	/* Oops.. spin */
+	 add	r2, r3, r3	/* r3 now points to the entry. */

All the register+register stores ([r2+r3]) need to be rewritten as r3+constant.

After that, traceit and traceitwin should be able to use the TRACEIT
macro.

Hm.  There may be some reason why I implemented traceit and traceitwin
with inline code rather than the TRACEIT macro, but I don't recall right
now.

I have tried to revert this patch but there are too much differences between r214 and actual locore.s.

I have tried to obtain more information on serial line without any real success. When kernel crashs, ddb is often dead too (or system does not send data over serial line) :-(

	Kernel crash dump were not usable.

I don't know how obtain more usable informations. If you want, I can open access to a stable Blade2000 (ssh + serial line) and to faulty system (ssh only).

	My last panic messages :

Mar 6 17:28:27 legendre /netbsd: trap type 0x34: cpu 1, pc=f0008380 npc=f0008384 pstate=0xffffffff88820006<PRIV,IE>

Mar  4 16:33:19 legendre /netbsd: trap type 0x34: cpu 1, pc=f0009080
text_access_fault: pc=5abf1cd8 va=5abf0000
Mar 4 16:33:19 legendre /netbsd: npc=f0009084 pstate=0xffffffff88820006<PRIV,IE>

Mar 5 18:28:36 legendre /netbsd: cpu1: data fault: pc=f000b1e0 rpc=103b435e0 addr=0
Mar  5 18:28:36 legendre /netbsd: text_access_fault: pc=5a01bcd8 va=5a01a000

Mar  4 00:53:39 legendre /netbsd: trap type 0x34: cpu 0, pc=f000898c
Skipping crash dump on recursive panic
Mar 4 00:53:39 legendre /netbsd: npc=f0008990 pstate=0xffffffff88820006<PRIV,IE>

Feb 27 11:13:26 legendre /netbsd: text_access_fault: pc=59fedcd8 va=59fec000
Feb 27 11:13:26 legendre /netbsd: Skipping crash dump on recursive panic
Feb 27 11:13:26 legendre /netbsd: panic: kernel fault

Mar  9 06:33:04 legendre /netbsd: text_access_fault: pc=59f9dcd8 va=59f9c000

	Best regards,

	JKB




Home | Main Index | Thread Index | Old Index