current-users: Re: BIND for secondary zone dumps core.

Subject: Re: BIND for secondary zone dumps core.
To: Greywolf <greywolf@starwolf.com>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 07/09/2001 02:39:56
[ On Sunday, July 8, 2001 at 21:43:04 (-0700), Greywolf wrote: ]
> Subject: Re: BIND for secondary zone dumps core.
>
> Core was generated by `named'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /usr/libexec/ld.elf_so...done.
> Reading symbols from /usr/lib/libc.so.12...done.
> #0  0x39a14 in ns_resp (msg=0xefffed68 "\027\037\204\200", msglen=187, from={
>       sin_len = 16 '\020', sin_family = 2 '\002', sin_port = 53, sin_addr = {
>         s_addr = 3499314287}, sin_zero = "\000\000\000\000\000\000\000"},
>     qsp=0x0)
>     at /export/src/usr.sbin/bind/named/../../../dist/bind/bin/named/ns_resp.c:459
> 459                     if (ina_equal(fwd->fwddata->fwdaddr.sin_addr, from.sin_addr))
> (gdb) print *fwd
> $1 = {next = 0x5001085, fwddata = 0x39a14}

Hmmm.... how about "print *(fwd->fwddata)"  (start gdb over again first)

I suspect 0x39a14 is invalid.  It's certainly far off from any of the
other pointers you're seeing in the same neighbourhood.

Maybe you can try to manually walk the whole forwarders list and print
all the data elements and see if there's any other obvious corruption.

Start with "print qp->q_fzone".  If that's 0x0 then the start of the
list is at server_options->fwdtab, else it's at qp->q_fzone->z_fwdtab.

> (gdb) print fwd
> $2 = (struct fwdinfo *) 0x0
> (gdb) 
> 
> What is up with that?!?  I should not be able to reference fwd->next if
> fwd is (struct fwdinfo *) NULL!

I don't know what's up with that.  Obviously gdb was able to deref it
once (i.e. the "print *fwd").  I don't know how it could suddenly go
null.  I have had some pretty strange problems with NetBSD gdb on sparc
though, and I don't really trust it that much....

We could take this offline I guess, though it seems as if there could be
good lessons for other readers.  I'm going to hit the hay soon too, so
maybe someone else will pick up the thread before I awake again....

Whatever the problem is it's got to be pretty deep.  Adding a slave zone
shouldn't affect the forwarders list.  If it's network data causing the
corruption then it's got to be pure luck that the forwarders list is
being hit the same way every time....  something to do with the machine
architecture, your specific config, and whatever data the other
nameserver is sending...

(it is reproducible, right to the line number, isn't it?)

It might also help to turn on debugging (-d 10 or something) when you
start named and watch what it prints just before it dies.  There's an
ns_debug() call right near the top of ns_resp() that should print at
level 2 or above.  Turning on query logging initiall (-q) might reveal
something too...

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>     <woods@robohack.ca>
Planix, Inc. <woods@planix.com>;   Secrets of the Weird <woods@weird.com>