Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

re: UltraSPARC III... Stability issue ?

> Most likely it's faulting because physical address 41818400 does not point 
> to DRAM.

hmmm, good point.  that's about 1GB-range.  let's see what the
prom shows:

f003f450: /memory

available               00000001 ffec6000 00000000 00014000   ......`.......@.
            0010:       00000001 ffea0000 00000000 00006000   ..............`.
            0020:       00000001 ffc00000 00000000 00280000   .............(..
            0030:       00000001 fec00000 00000000 003fe000   .............?..
            0040:       00000001 fe008000 00000000 007f8000   ................
            0050:       00000000 00000000 00000001 fe000000   ................

hmmm, so the vast majority of the 8GB of ram is present in 
the 0 to almost-8GB range, which 41818400 falls into.  so this
should be physical memory.  (i can confim it works later ifwhen
it fails with the current workload.)

> > eeh, we've basically not touched data_miss etc since your original
> > code... any ideas what would be causing this?
> The page tables are corrupted.  

this was my guess too.

> Inside of data_miss after you remove all of the extraneous debug and 
> accounting code:
> The code does something like this:
>       // Get the page table root into %g4
>       %g4 = ctxbusy[context(fault_address)];
>       // Get the first level page directory page
>       %g4 = %g4[seg_table_offset(fault_address)];
>       // Get the second level page table page
>       %g5 = %g4 + page_directory_offest(fault_address);
>       %g4 = *%g5;
>       // Get the PTE from the page table
>       %g6 = %g4 + page_table_offset(fault_address);
>       %g4 = *%g6;  <-- fault here.
> It's faulting on the last operation, which means the second level page 
> directory has at least one corrupted entry.
> Possibility #1: Something is broken in pseg_set().  
>       (Oh, wonderful.  Someone wrapeed the pseg() routines in a mutex 
>       even though they use compare and swap instructions to be SMP 
>       safe.)

that would be me -- they don't work without it.  i forget the
failure modes, but the system was unstable without this mutex.

> Possibility #2: One of the pages given to pseg_set() still has an active 
>       reference somewhere else and is still in use.
> Possibility #3: There's an error in the calculations that figure out the 
>       exent of physical memory and either UVM has two vm_page structures 
>       managing the same physical address.
> Possiblity #4: There's an error in the initialization code and one of the 
>       pages that should have been reseverd early on for boot time page 
>       table entries has also been given to UVM to hand out.
> My guess would be #4, but we don't have enough info to be sure.

#5 - something corrupted memory used for page tables.  which means
it could be almost anything..

#4 seems pretty unlikely to me -- i would expect this to trigger
a problem much faster than it does.

#1 and #3 seems unlikely as well.

#2 and #5 seem most likely to me.

> Boot a DEBUG kernel with the -V or -D options to get pmap_bootstrap() to 
> tell you how it's handling all the different physical memory segments.  
> Then you somehow need to either drop into DDB (or OBP)and walk the page 
> table that generated the fault to figure out how it got corrupted, or get 
> hold of the contents of the of %g5, which was used to load the page table 
> base address and find out what pmap_bootstrap() did with that page.

this problem is proving hard to reproduce.  the first (and only) time
i've seen it was the earlier post, i've had the box running busy for
most of 2 days without another failure.

and even more unfortunately, DDB doesn't work here.  it never enters
the command loop, but instead hangs hard.

if that keeps up, i'll try creating some sort of ddb-lwp that we switch
to ... instead of keeping on the current lwp, and see if that allows
ddb to be entered properly.


Home | Main Index | Thread Index | Old Index