Subject: Re: Problems trying to debug pkgsrc/mail/milter-greylist
To: Chris Ross <cross+netbsd@distal.com>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: port-sparc64
Date: 11/21/2007 14:13:33
--KsGdsel6WgEHnImy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Nov 21, 2007 at 10:17:38AM -0500, Chris Ross wrote:
>=20
> [ Martin Husemann suggested I contact you after I sent this message =20
> to the port-sparc64 list.
>   Let me know anything you can help with.  I'm just trying to even =20
> figure out how to see this
>   work at all. -]
>=20
>=20
>   Hi there.  I have a sparc64 running 4.0_RC3, and I built pkgsrc/=20
> mail/milter-greylist.  I notice that it sometimes just dies.  I =20
> upgraded the pkgsrc tree to the current release (4.0, vs the 3.0 =20
> that's in pkgsrc currently), but it seems to fail in about the same way.
>=20
>   I was seeing multiple problems.  Right now, however, I'm wondering =20
> if somethings wrong somewhere related to threading.  All of the =20
> errors seem to have a backtrace that ends:
>=20
> #7  0x00000000405137cc in pthread_join () from /usr/lib/libpthread.so.0
> #8  0x0000000040ba7fc0 in _lwp_makecontext () from /usr/lib/libc.so.12
> #9  0x0000000040ba7fc0 in _lwp_makecontext () from /usr/lib/libc.so.12
> Previous frame identical to this frame (corrupt stack?)
> (gdb)

I think that backtrace is actually ok. I think it'd indicitive of a=20
problem in phread_join().

What are the other parts of the trace showing?

>   The exact cause for the crash varies, but.  I'm not an expert on =20
> using gdb to debug threaded programs by any means, but was wondering:
>=20
>   1) The resolver in NetBSD 4 is BIND 9, so definitely thread-safe, =20
> right?

Yes. However...

>   2) Are gdb or libpthreads on sparc64 known to have any problems?

s/ on sparc64//

Yes.

When we imported the most-recent gdb, the threading support never got=20
added. So gdb in NetBSD 4.0 (and -current, actually) doesn't cope with=20
threaded programs. Which is really lame.

Are you running on an SMP system w/ an SMP kernel? libpthread in 4.0 also=
=20
has issues with concurrency.

We actually have a branch, wrstuden-fixsa, which is dedicated to fixing=20
the libpthread and Scheduler Activations issues in 4.0. I think it's=20
caught up with NetBSD-4.0_RC3. Feel free to try it. There also have been=20
fixes to gdb on it, and the new one (for i386 at least) actually shows=20
threads. I'm not sure if that's been pulled over to sparc64 or not.

>   3) Anyone have any good pointers to "how to debug a threaded =20
> program with gdb" ?

If gdb were working well, there are a few main classes of threading=20
issues. One is locking and the other is not locking. :-)

Locking issues usually either lead to live-lock or deadlock. At the app=20
level, it usually ends up deadlock. That's where thread 1 locks A and=20
tries to lock B, and thread 2 locks B and tries to lock A. Each waits for=
=20
the other, which is waiting for it, so we wait forever.

Live-locking is the same thing but with spinlocks. An application should=20
never actually see that, since the only spinlocks are actually in=20
libpthread. i.e. a spinlock issue really is a libpthread issue.

Not-locking issues are data corruption - i.e. something stomped on the=20
data we were working on.

If all the crashes are in pthread_join(), then it's probably some form of=
=20
libpthread problem.

>   milter-greylist doesn't have many threads running ever.  It just =20
> spawns off new threads for synchronization (which I'm not using), and =20
> dumping of data.  One for reading the config file too, I think, but I =20
> suspect that's not happening repeatedly.
>=20
>   Anyway.  Thanks...

Please let me know more about the backtraces. I actually fixed a number of=
=20
concurrency issues in libpthread-SA recently.

Take care,

Bill

--KsGdsel6WgEHnImy
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (NetBSD)

iD4DBQFHRK2MWz+3JHUci9cRAjNXAJdi1prTB+eM9pchVh/tqaRCzN19AJ9TMQY3
2Br2UbelMA7sEe0V/Wu/oQ==
=dbCM
-----END PGP SIGNATURE-----

--KsGdsel6WgEHnImy--