Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: How to get a crash dump with recursive panic?



On Tue, 10 Jun 2014, Darren Reed wrote:

> On 10/06/2014 1:45 AM, Eduardo Horvath wrote:
> > On Mon, 9 Jun 2014, Darren Reed wrote:
> > 
> > > In testing out ipfilter on sparc64, I see a bunch of "Alignment error"
> > > messages like these:
> > > 
> > > Alignment error: pid=24522.1 comm=ipfstat dsfsr=00000000:00800001
> > > dsfar=ffffffff:fea0c252 isfsr=00000000:00808000 pc=10e3b0
> > > Alignment error: pid=22537.1 comm=ipfstat dsfsr=00000000:00800001
> > > dsfar=ffffffff:fea02252 isfsr=00000000:00808000 pc=10e3b0
> > > Alignment error: pid=6845.1 comm=ipfstat dsfsr=00000000:00800001
> > > dsfar=ffffffff:fea02252 isfsr=00000000:00808000 pc=10e3b0
> > > 
> > > Followed by a panic like this:
> > > 
> > > trap type 0x34: cpu 0, pc=109faac npc=109fab0 pstate=0x820006<PRIV,IE>
> > > Skipping crash dump on recursive panic
> > > panic: mem address not aligned
> > > cpu0: Begin traceback...
> > > cpu0: End traceback...
> > > cpu1: shutting down
> > > cpu0: rebooting
> > > 
> > > All that I can do is:
> > > (gdb) x/i 0x109faac
> > >     0x109faac <ipf_fixskip+44>:  ldx  [ %g4 + 0x20 ], %g4
> > > 
> > > Further tips anyone?
> > What's the previous panic look like?  (I wonder if we have an SMP bug in
> > vpanic()...)
> 
> How do I find it?

The "Skipping crash dump on recursive panic" implies there should have 
been a panic before the "panic: mem address not aligned".

vpanic() uses the global variable doing_shutdown to indicate a panic is in 
progress.  It doesn't look like that variable is protected by a lock, so 
if multiple CPUs are panicing at the same time maybe vpanic() can get 
confused and assume they are all recursive panics.  Not that it really 
matters....

> As this is from the serial console, I'm assuming that if it never
> gets printed on the console then it never gets printed anywhere.
> 
> 
> > 
> > Trap type 0x34 is an alignment trap.  The instruction in question is
> > trying to load an 8-byte integer pointed to by %g4+0x20 into %g4.  You can
> > enable DDB and dump the registers to find the contents of %g4.  That
> > should not be 8-byte aligned.
> > 
> > Beyond that it's a question of debugging the ipfilter code.
> > 
> > That ipfstat is getting unaligned accesses implies some data structure is
> > unaligned.  You can slap gdb on it to find out what, or you can break into
> > DDB and set the TDB_STOPSIG bit in trapdebug to have the kernel break into
> > DDB on each unaligned access and debug it from there.
> 
> Yes - are those messages from the user space code running or kernel space?

Those messages are printed by the kernel if DEBUG is defined when a 
userland process generates an unaligned access exception.  What usually 
happens after that is a SIGBUS is posted to the process and the process 
dies.  If the process has the MDP_FIXALIGN flag set, the kernel will 
attempt to emulate the instruction instead.  (I don't think there's any 
code that actually sets that flag, but if it is ever set and you start 
getting lots of unaligned accesses you definitely want to know about it 
'cause emulating instructions will cause major performance degradation.)

Anyway, the simple answer is that all those initial messages are being 
generated because the "ipfstat" process is attempting to issue an 
unaligned memory access in userland.  In this case the contents of the 
DSFSR should give details about the type of misalignment and the DSFAR has 
contains the faulting address.  The address always appears to be 
0xfffffffffea02252 (or 0xfea02252 if you're running in 32-bit mode) which 
is 16-bit aligned.  The DSFSR should indicate whether it was a 32-bit or 
64-bit load or store, it's just a question of looking up the register's 
bit definitions.  

Does that help?

Eduardo


Home | Main Index | Thread Index | Old Index