Subject: Re: Panic killing a process
To: Greg 'groggy' Lehey <grog@NetBSD.org>
From: Julio M. Merino Vidal <jmmv@menta.net>
List: current-users
Date: 01/08/2005 11:13:21
On Sat, 8 Jan 2005 12:05:32 +1030
Greg 'groggy' Lehey <grog@NetBSD.org> wrote:

> You could start by looking at what's going on in frame 9.  Look at the
> local variables.  On the face of it I'd guess a null pointer
> dereference, but you should have messages from trap() telling you what
> happens.

Yeah, well... the problem is that I shouldn't spend a lot of time in front
of the computer these days...

Anyway, here is what I got:

The kernel fails due to (thanks to Pavel Cahyna for telling me about
dmesg -M):
uvm_fault(0xca728460, 0x97c14000, 0, 2) -> 0xe

Then, in frame 9 we have that the offending function is fdfree in
kern_descript.c.  It fails in line 1284 on the call to knote_fdclose:

(gdb) frame 9
#9  0xc021318d in fdfree (p=0xca72ce5c)
    at /usr/src/sys/kern/kern_descrip.c:1284
1284                                    knote_fdclose(p, fdp->fd_lastfile - i);

Upon that point, frame 8 is already a trap, so knote_fdclose is not
even reached, right?  Therefore the problem has to be in some of its
parameters, that is, in the access to fdp.  Isn't it?

However, line 1283 "if (i < fdp->fd_knlistsize)" also access that
structure and it doesn't crash (thoug maybe due to different offsets,
if the pointer is really wrong).

Now, what I find strange is this:

(gdb) p fdp
$25 = <incomplete type>

Though, fdp is a struct filedesc*, which is assigned at the very
beginning of the function (fdp = p->p_fd) to point to p->p_fd.  I can
access the structure this way without problems (this is why it looks
strange to me, but I can be missing something obvious):

(gdb) p p->p_fd
$26 = (struct filedesc *) 0xca72458c
(gdb) p *p->p_fd
$27 = {
  fd_ofiles = 0xc0fb1800, 
  fd_ofileflags = 0xc0fb1e40 "", 
  fd_nfiles = 400, 
  fd_himap = 0xca724624, 
  fd_lomap = 0xca724628, 
  fd_lastfile = 256, 
  fd_freefile = 192, 
  fd_refcnt = 0, 
  fd_knlistsize = 256, 
  fd_knlist = 0xc0f9e400, 
  fd_knhashmask = 0, 
  fd_knhash = 0x0, 
  fd_slock = {
    lock_data = -559038737
  }
}

Now, another thing that looks strange to my inexperienced eyes:

(gdb) p *fp
$4 = {
  f_list = {
    le_next = 0xc101ac00, 
    le_prev = 0x62696c2f
  }, 
  f_flag = 1667594341, 
  f_iflags = 778333231, 
  f_type = 1600547941, 
  f_count = 3405803379, 
  f_msgcount = 3420497392, 
  f_usecount = -874469920, 
  f_cred = 0xcbe0a5c0, 
  f_ops = 0xcbe0a5b0, 
  f_offset = -3755820047912622688, 
  f_data = 0xcbe0a580, 
  f_slock = {
    lock_data = -874470032
  }
}

Is that negative value in f_usecount correct?  According to file.h, that's
"number active users", so seems wrong.  And the one in f_offset?  (And
in f_slock?)

> You might also like to take a look at
> http://www.lemis.com/grog/Papers/Debug-tutorial/slides.pdf and
> http://www.lemis.com/grog/Papers/Debug-tutorial/tutorial.pdf.

Thanks for the pointers!  They look very interesting; will see if I have
enough time to read them ;)

Cheers

-- 
Julio M. Merino Vidal <jmmv@menta.net>
http://www.livejournal.com/users/jmmv/
The NetBSD Project - http://www.NetBSD.org/