re: UltraSPARC III... Stability issue ?

To: matthew green <mrg%eterna.com.au@localhost>
Subject: re: UltraSPARC III... Stability issue ?
From: Eduardo Horvath <eeh%NetBSD.org@localhost>
Date: Mon, 7 Apr 2014 16:32:57 +0000 (UTC)

On Sat, 5 Apr 2014, matthew green wrote:

> > data error type 32 sfsr=808004 sfva=41818400 afsr=10100000000000 
> > afva=13900000060 tf=0x2607efed0
> > data fault: pc=1010234 addr=41818400 sfsr=0x808004<ASI=0x80,OW>
> 
> Dump of assembler code for function data_miss:
>    [ ... ]
>    0x0000000001010224 <+132>:   ldxa  [ %g5 ] #ASI_PHYS_USE_EC, %g4
>    0x0000000001010228 <+136>:   sll  %g6, 3, %g6
>    0x000000000101022c <+140>:   brz,pn   %g4, 0x1010278 <data_nfo>
>    0x0000000001010230 <+144>:   add  %g6, %g4, %g6
>    0x0000000001010234 <+148>:   ldxa  [ %g6 ] #ASI_PHYS_USE_EC, %g4
> 
> i'll need to consult the manual(s) and see why this is faulting.
> up to this point, crashing has occured normally.

Most likely it's faulting because physical address 41818400 does not point 
to DRAM.

> 
> > kernel trap 32: data access error
> 
> but here, we're getting another fault trying to enter DDB.
> 
> > cpu1: data fault: pc=16068c8 rpc=10f5914 addr=ffffffffffff8000
> 
> Dump of assembler code for function memcpy:
>    [ ... ]
>    0x00000000016068bc <+348>:   brz  %l4, 0x1606a40 <memcpy+736>
>    0x00000000016068c0 <+352>:   sllx  %l4, 3, %l4
>    0x00000000016068c4 <+356>:   mov  0x40, %l3
>    0x00000000016068c8 <+360>:   ldx  [ %l0 ], %o0
> 
> and
> 
> Dump of assembler code for function fill_ddb_regs_from_tf:
>    [ ... ]
> 213             DDB_REGS->db_fr = *(struct frame64 *)(uintptr_t)tf->tf_out[6];
>    0x00000000010f5904 <+68>:    ldx  [ %i5 + 0x1b0 ], %g1
>    0x00000000010f5908 <+72>:    mov  0xb0, %o2
>    0x00000000010f590c <+76>:    ldx  [ %i0 + 0xa0 ], %o1
>    0x00000000010f5910 <+80>:    ldx  [ %g1 + 0x3e0 ], %o0
>    0x00000000010f5914 <+84>:    call  0x1606760 <memcpy>
>    0x00000000010f5918 <+88>:    add  %o0, 0x130, %o0
> 
> so this looks like we fault trying to read the faulting lwp's
> registers to save them for DDB to access.  oops!
> 
> > kernel trap 30: data access exception
> > Skipping crash dump on recursive panic
> > panic: cpu0: ipi_send: couldn't send ipi to UPAID 1 (tried 10000 times)
> 
> this most likely happens because cpu1 is busy writing to the slow
> console (serial or fb, either takes a Long time relatively.)  we
> might be able to extend the limit from 10000 to more to avoid that,
> or also have ipi_send notice when a remote CPU is panicking.
> 
> > cpu0: Begin traceback...
> > cpu0: End traceback...
> 
> i suspect this is because we can't copy the faulting registers..
> 
> this is all likely related to the basic failure that triggers
> the original fault in data_miss.
> 
> eeh, we've basically not touched data_miss etc since your original
> code... any ideas what would be causing this?

The page tables are corrupted.  

Inside of data_miss after you remove all of the extraneous debug and 
accounting code:

The code does something like this:

        // Get the page table root into %g4
        %g4 = ctxbusy[context(fault_address)];

        // Get the first level page directory page
        %g4 = %g4[seg_table_offset(fault_address)];

        // Get the second level page table page
        %g5 = %g4 + page_directory_offest(fault_address);
        %g4 = *%g5;

        // Get the PTE from the page table
        %g6 = %g4 + page_table_offset(fault_address);
        %g4 = *%g6;  <-- fault here.

It's faulting on the last operation, which means the second level page 
directory has at least one corrupted entry.

Possibility #1: Something is broken in pseg_set().  
        (Oh, wonderful.  Someone wrapeed the pseg() routines in a mutex 
        even though they use compare and swap instructions to be SMP 
        safe.)

Possibility #2: One of the pages given to pseg_set() still has an active 
        reference somewhere else and is still in use.

Possibility #3: There's an error in the calculations that figure out the 
        exent of physical memory and either UVM has two vm_page structures 
        managing the same physical address.

Possiblity #4: There's an error in the initialization code and one of the 
        pages that should have been reseverd early on for boot time page 
        table entries has also been given to UVM to hand out.

My guess would be #4, but we don't have enough info to be sure.

Boot a DEBUG kernel with the -V or -D options to get pmap_bootstrap() to 
tell you how it's handling all the different physical memory segments.  
Then you somehow need to either drop into DDB (or OBP)and walk the page 
table that generated the fault to figure out how it got corrupted, or get 
hold of the contents of the of %g5, which was used to load the page table 
base address and find out what pmap_bootstrap() did with that page.

Eduardo

Follow-Ups:
- re: UltraSPARC III... Stability issue ?
  - From: matthew green

References:
- re: UltraSPARC III... Stability issue ?
  - From: matthew green

Prev by Date: re: UltraSPARC III... Stability issue ?
Next by Date: re: UltraSPARC III... Stability issue ?
Previous by Thread: re: UltraSPARC III... Stability issue ?
Next by Thread: re: UltraSPARC III... Stability issue ?
Indexes:

Home | Main Index | Thread Index | Old Index