port-sparc64: Re: bus error with leafnode

Subject: Re: bus error with leafnode
To: Charles Shannon Hendrix <shannon@widomaker.com>
From: Eduardo Horvath <eeh@NetBSD.org>
List: port-sparc64
Date: 01/03/2007 16:42:40
On Tue, 2 Jan 2007, Charles Shannon Hendrix wrote:

> What is a general cause, if any, for bus errors in NetBSD sparc64 when a
> process is started from inetd?

Bus errors on sparc64 are caused by either a memory access of the wrong
type (accessing memory with an uncached ASI or I/O space with a cached
ASI), a hardware problem with a device (i.e. PCI timout), or unaligned
memory access.  Accessing an invalid address should cause a segmentation
violation.  The first two will only happen inside the kernel or when a 
userland process directly maps in device hardware (such as the X server).  

You are most likely suffering from an unaligned access.  While this can
be caused by corruption of the saved machine state, such as is caused
by the pthreads bugs, it is most likely caused by buggy software that
assumes the machine can do unaligned accesses in hardware or 
int/long/pointer confusion.  

> Anything to shorten the debugging time would be appreciated.

You need to use a debugger to analyze the problem.  Building the
binary with debug enabled (-g flag) may help, but is probably not
necessary.  You do need to either attach a debugger to the running
binary (I don't know if you can do that if it crashes too quickly)
or get it to generate a coredump.

Coredumps are only generated if the coredumpsize limit is large
enough and the process is not SUID.  Limits are inherited from the
parent process, and if inetd is started by the standard init
scripts it probably has its coredumpsize limit set to zero.  Also, 
inetd often changes the userid of the process it spawns.  Both of
these situations will prevent proper coredumps from being generated.
What you will probably have to do is have inetd invoke a shell
script that unlimits the coredumpsize and then executes your binary.
This should hopefully solve both of the above problems.

Now that you have a coredump, you can use a debugger on it to find
out why and where it's issuing the unaligned access and then make
the proper corrective changes to the sources.

In general you will find that someone did questionable casting
of a pointer to assign it from one type to another type or the
code is passing a reference to an int where it should be passing
a reference to a long.

This sort of bug does not manifest itself on x86 because those
processors do not trap unaligned accesses but issue multiple
memory operations to transfer the data, and being little-endian
they can get by casting longs or pointers to ints without doing
proper data conversion, thus encouraging buggy code.

Eduardo