Re: RED State Exception on E3500

To: Julian Coleman <jdc%coris.org.uk@localhost>
Subject: Re: RED State Exception on E3500
From: Eduardo Horvath <eeh%NetBSD.org@localhost>
Date: Thu, 18 Aug 2011 16:22:54 +0000 (UTC)

On Thu, 18 Aug 2011, Julian Coleman wrote:

> with `cpuctl offline cpu0` (UPA ID 6), the build stopped with a segmentation
> fault in gc, and the thrid failed with (very similar to the first):
> 
> RED State Exception on CPU = 0000.0000.0000.0012
> TL=0000.0000.0000.0005 TT=0000.0000.0000.0010
>    TPC=0000.0000.0100.4200
> TL=0000.0000.0000.0004 TT=0000.0000.0000.0010
>    TPC=0000.0000.0100.4200
> TL=0000.0000.0000.0003 TT=0000.0000.0000.0010
>    TPC=0000.0000.0100.5900
> TL=0000.0000.0000.0002 TT=0000.0000.0000.00c8
>    TPC=0000.0000.0100.91d8
> TL=0000.0000.0000.0001 TT=0000.0000.0000.00d8
>    TPC=0000.0000.4095.0064
> 
> The kernel code relating to those addresses is:
> 
>   (gdb) disass 0x010091d8
>   Dump of assembler code for function rft_user:
>     ...
>   0x00000000010091c4 <rft_user+272>:      wrpr  %g1, %tstate
>   0x00000000010091c8 <rft_user+276>:      wrpr  1, %tl
>   0x00000000010091cc <rft_user+280>:      wrpr  %g2, 0, %tpc
>   0x00000000010091d0 <rft_user+284>:      wrpr  %g3, 0, %tnpc
>   0x00000000010091d4 <rft_user+288>:      wrpr  %g1, %tstate
>   0x00000000010091d8 <rft_user+292>:      restore 
>   0x00000000010091dc <rft_user+296>:      rdpr  %canrestore, %g5
> 
>   (gdb) disass 0x01005900
>   Dump of assembler code for function nufill4:
>     ...
>   0x00000000010058f8 <nufill4+120>:       nop 
>   0x00000000010058fc <nufill4+124>:       nop 
>   0x0000000001005900 <nufill4+128>:       andcc  %sp, 1, %i0
>   0x0000000001005904 <nufill4+132>:       bne  0x1005804 <nufill8+4>
> 
>   (gdb) disass 0x01004200
>   Dump of assembler code for function ktextfault:
>   0x00000000010041f8 <ktextfault+248>:    nop 
>   0x00000000010041fc <ktextfault+252>:    nop 
>   0x0000000001004200 <ktextfault+256>:    b,a   %icc, 0x10089bc <slowtrap>
>   0x0000000001004204 <ktextfault+260>:    nop 
> 
> Any ideas?  Hardware or software (POST on diag-level max shows no errors)?

Let's take a look at this failure.

TL1 is a 0xd8 fault which is a window fill fault at address 0x40950064, 
wherever that is.  That doesn't really look like a valid kernel address.

TL2 is a 0xc8 fault which is a window fill fault at address 0100.91d8, 
which is the restore instruction in rft_user.  

TL3 is a 0x10 fault which is an illegal instruction fault at 0100.5900, 
which is a nucleus fill fault handler.

TL4 is another illegal instruction at 0100.4200, which is in the trap 
table itself.

It sounds like the machine was running along happily in userland until it 
took a window fill fault, which probably resulted in a MMU fault which is 
no longer evident in the trap registers.  Then it tried to return to 
userland and took a fault to refill the userland register windows, but 
executing instructions in the trap table resulted in illegal instructions.

I expect this could be due to either bad hardware or somehow the 
incorrect values are getting into the instruction cache.

If you want to determine for sure whether it's a hardware problem or a 
software problem, you can add a little loop like the one in blast_icache 
to clear out the instruction cache just before the RESTORE instruction.  
Register scheduling may be an issue there, but traps are already disabled 
so the code should be simpler.

Let's see... %g1, %g2, and %g3 should be available.  Something like this:

5:
        /*
         * Set up our return trapframe so we can recover if we trap from 
here
         * on in.
         */
        wrpr    %g0, 1, %tl                     ! Set up the trap state
        wrpr    %g2, 0, %tpc
        wrpr    %g3, 0, %tnpc
        wrpr    %g1, %g0, %tstate
>
>       sethi   %hi(icache_size), %g1
>       ld      [%g1 + %lo(icache_size)], %g1
>       sethi   %hi(icache_line_size), %g2
>       ld      [%g2 + %lo(icache_line_size)], %g2
>       sub     %g1, %g2, %g1
>7:
>       stxa    %g0, [%g1] ASI_ICACHE_TAG
>       brnz,pt %g1, 7b
>        sub    %g1, %g2, %g1
>       sethi   %hi(KERNBASE), %g1
>       flush   %g1
>       membar  #Sync
>
        restore
6:

Hm.  blast_icache* could be optimized by restructuring the loop and using 
SUBCC instead of BRNZ.  Well, a couple of cycles here or there won't 
matter.

Eduardo

Follow-Ups:
- Re: RED State Exception on E3500
  - From: Julian Coleman

References:
- RED State Exception on E3500
  - From: Julian Coleman

Prev by Date: RED State Exception on E3500
Next by Date: Re: RED State Exception on E3500
Previous by Thread: RED State Exception on E3500
Next by Thread: Re: RED State Exception on E3500
Indexes:

Home | Main Index | Thread Index | Old Index