Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

re: UltraSPARC III... Stability issue ?



> data error type 32 sfsr=808004 sfva=41818400 afsr=10100000000000 
> afva=13900000060 tf=0x2607efed0
> data fault: pc=1010234 addr=41818400 sfsr=0x808004<ASI=0x80,OW>

Dump of assembler code for function data_miss:
   [ ... ]
   0x0000000001010224 <+132>:   ldxa  [ %g5 ] #ASI_PHYS_USE_EC, %g4
   0x0000000001010228 <+136>:   sll  %g6, 3, %g6
   0x000000000101022c <+140>:   brz,pn   %g4, 0x1010278 <data_nfo>
   0x0000000001010230 <+144>:   add  %g6, %g4, %g6
   0x0000000001010234 <+148>:   ldxa  [ %g6 ] #ASI_PHYS_USE_EC, %g4

i'll need to consult the manual(s) and see why this is faulting.
up to this point, crashing has occured normally.

> kernel trap 32: data access error

but here, we're getting another fault trying to enter DDB.

> cpu1: data fault: pc=16068c8 rpc=10f5914 addr=ffffffffffff8000

Dump of assembler code for function memcpy:
   [ ... ]
   0x00000000016068bc <+348>:   brz  %l4, 0x1606a40 <memcpy+736>
   0x00000000016068c0 <+352>:   sllx  %l4, 3, %l4
   0x00000000016068c4 <+356>:   mov  0x40, %l3
   0x00000000016068c8 <+360>:   ldx  [ %l0 ], %o0

and

Dump of assembler code for function fill_ddb_regs_from_tf:
   [ ... ]
213             DDB_REGS->db_fr = *(struct frame64 *)(uintptr_t)tf->tf_out[6];
   0x00000000010f5904 <+68>:    ldx  [ %i5 + 0x1b0 ], %g1
   0x00000000010f5908 <+72>:    mov  0xb0, %o2
   0x00000000010f590c <+76>:    ldx  [ %i0 + 0xa0 ], %o1
   0x00000000010f5910 <+80>:    ldx  [ %g1 + 0x3e0 ], %o0
   0x00000000010f5914 <+84>:    call  0x1606760 <memcpy>
   0x00000000010f5918 <+88>:    add  %o0, 0x130, %o0

so this looks like we fault trying to read the faulting lwp's
registers to save them for DDB to access.  oops!

> kernel trap 30: data access exception
> Skipping crash dump on recursive panic
> panic: cpu0: ipi_send: couldn't send ipi to UPAID 1 (tried 10000 times)

this most likely happens because cpu1 is busy writing to the slow
console (serial or fb, either takes a Long time relatively.)  we
might be able to extend the limit from 10000 to more to avoid that,
or also have ipi_send notice when a remote CPU is panicking.

> cpu0: Begin traceback...
> cpu0: End traceback...

i suspect this is because we can't copy the faulting registers..

this is all likely related to the basic failure that triggers
the original fault in data_miss.

eeh, we've basically not touched data_miss etc since your original
code... any ideas what would be causing this?


.mrg.


Home | Main Index | Thread Index | Old Index