Subject: Re: kernel: alignment fault trap on sparc
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Eduardo Horvath <eeh@NetBSD.org>
List: tech-kern
Date: 06/07/2004 21:10:13
On Mon, Jun 07, 2004 at 10:38:18PM +0200, Manuel Bouyer wrote:
> On Mon, Jun 07, 2004 at 06:26:38PM +0000, Eduardo Horvath wrote:
> > > > Otherwise, it could be that the instruction in the instruction cache does not
> > > > match the contents in memory,
> > > 
> > > Software bug ? we have had cache issues on sun4c in the past ...
> > 
> > Could be cache coherency issues.
> 
> Yes. But the fault address don't seem to be on a cache line boundary.
> However, it's at a function call.
> 
> 
> BTW, when I get
> trap type 0x7: pc=0xf01c4090 npc=0xf01c4094
> pc is the address of the instruction which caused the trap, right ?

Yes, the pc should point to the faulting instruction.

> This is the second instruction of uvmfault_anonget:
> db> x/i 0xf01c408c
> netbsd:uvmfault_anonget:        save            %sp, -0x70, %sp
> db> 
> netbsd:uvmfault_anonget+0x4:    sethi           %hi(0xf02e3000), %l6
> db> 
> netbsd:uvmfault_anonget+0x8:    or              %l6, 0x2c, %g1
> db> 
> netbsd:uvmfault_anonget+0xc:    ld              [%g1 + 0x10c], %g2
> db> 
> netbsd:uvmfault_anonget+0x10:   or              %g0, %i0, %l2
> db> 
> netbsd:uvmfault_anonget+0x14:   add             %g2, 0x1, %g2
> db> 
> netbsd:uvmfault_anonget+0x18:   st              %g2, [%g1 + 0x10c]
> 
> The cache boundary may not be relevant: we jump from uvm_fault(), so
> we could have a cache issue anyway.

I wouldn't worry about cache bounaries at the moment.

> Or the pc is off by one instruction when the trap occurs,
> and it's the save which cause the trap.
> I hope there's a way to look at the registers content when in ddb.
> 
> What does save %sp, -0x70, %sp do ?

The save insn could cause a trap.  What save does is rotate the register
windows.  (Let's see how well my memory works.  It's been a while since
I dealt w/a V8 machine.)  There is a Window Invalid Mask (WIM) in the PSR
which marks the available register windows.  The save insn will rotate the
register windows once.  If the bit in the PSR corresponding to the new 
window is set, the processor takes a window overflow trap.  The fault
handler will then try to create a clean register window by storing the
contents of the next window at the location pointed to by the stack pointer
(%o6 or %i6, depending on which side of the window you are) and twiddling
the WIM.  It then re-executes the save insn.  So, no, it should not cause
an alignment trap either. 

> 
> The call to uvmfault_anonget() is:
> uvmfault_anonget(&ufi, amap, anon)
> 
> > 
> > > > or your CPU is getting old and flakey.  I've seen
> > > > this happen a lot with old machines.
> > > 
> > > I didn't have much problems with sparc yet. And this box started doing this
> > > right after the upgrade, it was solid under 1.6.2.
> > 
> > Could be a coincidence that the hardware broke at around the same time you
> > updated the software.  I've seen that happen on occasion.
> > 
> > In any case, you need a proper crashdump analysis.
> 
> Unfortunably I'm not familiar with assembly, and I couldn't get a dump to disk
> yet.

Yes, that would be a bit of a problem.

See if you can convince someone who knows how the caches on those processors
to write a bit of code to dump the cache contents so you can compare it with
what's really in memory and see if it is a cache coherence issue.

Eduardo