Port-sparc64 archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: RED State Exception on E3500
On Thu, 18 Aug 2011, Julian Coleman wrote:
> with `cpuctl offline cpu0` (UPA ID 6), the build stopped with a segmentation
> fault in gc, and the thrid failed with (very similar to the first):
>
> RED State Exception on CPU = 0000.0000.0000.0012
> TL=0000.0000.0000.0005 TT=0000.0000.0000.0010
> TPC=0000.0000.0100.4200
> TL=0000.0000.0000.0004 TT=0000.0000.0000.0010
> TPC=0000.0000.0100.4200
> TL=0000.0000.0000.0003 TT=0000.0000.0000.0010
> TPC=0000.0000.0100.5900
> TL=0000.0000.0000.0002 TT=0000.0000.0000.00c8
> TPC=0000.0000.0100.91d8
> TL=0000.0000.0000.0001 TT=0000.0000.0000.00d8
> TPC=0000.0000.4095.0064
>
> The kernel code relating to those addresses is:
>
> (gdb) disass 0x010091d8
> Dump of assembler code for function rft_user:
> ...
> 0x00000000010091c4 <rft_user+272>: wrpr %g1, %tstate
> 0x00000000010091c8 <rft_user+276>: wrpr 1, %tl
> 0x00000000010091cc <rft_user+280>: wrpr %g2, 0, %tpc
> 0x00000000010091d0 <rft_user+284>: wrpr %g3, 0, %tnpc
> 0x00000000010091d4 <rft_user+288>: wrpr %g1, %tstate
> 0x00000000010091d8 <rft_user+292>: restore
> 0x00000000010091dc <rft_user+296>: rdpr %canrestore, %g5
>
> (gdb) disass 0x01005900
> Dump of assembler code for function nufill4:
> ...
> 0x00000000010058f8 <nufill4+120>: nop
> 0x00000000010058fc <nufill4+124>: nop
> 0x0000000001005900 <nufill4+128>: andcc %sp, 1, %i0
> 0x0000000001005904 <nufill4+132>: bne 0x1005804 <nufill8+4>
>
> (gdb) disass 0x01004200
> Dump of assembler code for function ktextfault:
> 0x00000000010041f8 <ktextfault+248>: nop
> 0x00000000010041fc <ktextfault+252>: nop
> 0x0000000001004200 <ktextfault+256>: b,a %icc, 0x10089bc <slowtrap>
> 0x0000000001004204 <ktextfault+260>: nop
>
> Any ideas? Hardware or software (POST on diag-level max shows no errors)?
Let's take a look at this failure.
TL1 is a 0xd8 fault which is a window fill fault at address 0x40950064,
wherever that is. That doesn't really look like a valid kernel address.
TL2 is a 0xc8 fault which is a window fill fault at address 0100.91d8,
which is the restore instruction in rft_user.
TL3 is a 0x10 fault which is an illegal instruction fault at 0100.5900,
which is a nucleus fill fault handler.
TL4 is another illegal instruction at 0100.4200, which is in the trap
table itself.
It sounds like the machine was running along happily in userland until it
took a window fill fault, which probably resulted in a MMU fault which is
no longer evident in the trap registers. Then it tried to return to
userland and took a fault to refill the userland register windows, but
executing instructions in the trap table resulted in illegal instructions.
I expect this could be due to either bad hardware or somehow the
incorrect values are getting into the instruction cache.
If you want to determine for sure whether it's a hardware problem or a
software problem, you can add a little loop like the one in blast_icache
to clear out the instruction cache just before the RESTORE instruction.
Register scheduling may be an issue there, but traps are already disabled
so the code should be simpler.
Let's see... %g1, %g2, and %g3 should be available. Something like this:
5:
/*
* Set up our return trapframe so we can recover if we trap from
here
* on in.
*/
wrpr %g0, 1, %tl ! Set up the trap state
wrpr %g2, 0, %tpc
wrpr %g3, 0, %tnpc
wrpr %g1, %g0, %tstate
>
> sethi %hi(icache_size), %g1
> ld [%g1 + %lo(icache_size)], %g1
> sethi %hi(icache_line_size), %g2
> ld [%g2 + %lo(icache_line_size)], %g2
> sub %g1, %g2, %g1
>7:
> stxa %g0, [%g1] ASI_ICACHE_TAG
> brnz,pt %g1, 7b
> sub %g1, %g2, %g1
> sethi %hi(KERNBASE), %g1
> flush %g1
> membar #Sync
>
restore
6:
Hm. blast_icache* could be optimized by restructuring the loop and using
SUBCC instead of BRNZ. Well, a couple of cycles here or there won't
matter.
Eduardo
Home |
Main Index |
Thread Index |
Old Index