Subject: Re: Repeatable crash
To: Andrew Doran <ad@netbsd.org>
From: Paul Goyette <paul@whooppee.com>
List: current-users
Date: 02/04/2008 10:12:49
On Mon, 4 Feb 2008, Paul Goyette wrote:
> On Mon, 4 Feb 2008, Andrew Doran wrote:
>
>> On Mon, Feb 04, 2008 at 03:40:02AM -0800, Paul Goyette wrote:
>>
>>> This is from a 4.99.49 kernel and userland built from sources dated
>>> 2008-01-24 21:14:50 UTC
>>>
>>> Usually, I run build.sh on the machine that actually contains the
>>> source, but this time I ran it on another host. The entire /usr/src and
>>> /usr/obj directories were NFS-mounted. The crash happens about 30 or 40
>>> minutes after starting build.sh and it happens at different places, so I
>>> don't think it's data-specific; rather I suspect some strange race
>>> condition. The back-traces don't seem terribly usefule (maybe gdb is
>>> out-of-sync with the trap stack-frame again?).
>>
>> The trap frame layout changed, I think dsl@netbsd.org is looking at it.
>> Did you get anything out of ddb?
>
> Transcribed by hand since I don't have a serial console:
>
> kernel: protection fault trap, code=0
> Stopped in pid 29318.1 (x86_64--netbsd-g) at
> netbsd:nfs_loadattrcache+0x13c: cmpl %ecx,0x10(%rbx)
> nfs_loadattrcache() at netbsd:nfs_loadattrcache+0x13c
> nfsm_loadattrcache() at netbsd:nfs_loadattrcache+0x70
> nfs_lookup() at netbsd:nfs_lookup+0xdbd
> VOP_LOOKUP() at netbsd:VOP_LOOKUP+0x49
> lookup() at netbsd:lookup+0x345
> namei() at netbsd:namei+0x1a1
> sys_access() at netbsd:sys_access+0x97
> syscall() at netbsd:syscall+0xa9
>
> Unable to enter any commands at the ddb prompt - it appears that the keyboard
> (USB) is dead or interrupts blocked.
Looking at the sources, nfs_loadattrcache+0x13c is here:
(gdb) list * nfs_loadattrcache + 0x13c
0xffffffff8019d44c is in nfs_loadattrcache
(/usr/src/sys/nfs/nfs_subs.c:1687).
1682 vap = np->n_vattr;
1683
1684 /*
1685 * Invalidate access cache if uid, gid, mode or ctime changed.
1686 */
1687 if (np->n_accstamp != -1 &&
1688 (gid != vap->va_gid || uid != vap->va_uid || vmode != vap->va_mode
1689 || timespeccmp(&ctime, &vap->va_ctime, !=)))
1690 np->n_accstamp = -1;
1691
Disassembling this area gives us (offset 0x13c --> +316)
0xffffffff8019d416 <nfs_loadattrcache+262>: mov %r9d,0xa0(%r13)
0xffffffff8019d41d <nfs_loadattrcache+269>: mov %rax,0xa8(%r13)
0xffffffff8019d424 <nfs_loadattrcache+276>: mov 0xc(%r12),%eax
0xffffffff8019d429 <nfs_loadattrcache+281>: mov %eax,%esi
0xffffffff8019d42b <nfs_loadattrcache+283>: bswap %esi
0xffffffff8019d42d <nfs_loadattrcache+285>: mov 0x10(%r12),%ecx
0xffffffff8019d432 <nfs_loadattrcache+290>: bswap %ecx
0xffffffff8019d434 <nfs_loadattrcache+292>: cmpl $0xffffffffffffffff,0xf8(%r13)
0xffffffff8019d43c <nfs_loadattrcache+300>: mov 0x80(%r13),%rbx
0xffffffff8019d443 <nfs_loadattrcache+307>: movzwl 0xffffffffffffffd6(%rbp),%eax
0xffffffff8019d447 <nfs_loadattrcache+311>: je 0xffffffff8019d461 <nfs_loadattrcache+337>
0xffffffff8019d449 <nfs_loadattrcache+313>: cmp %ecx,0x10(%rbx)
0xffffffff8019d44c <nfs_loadattrcache+316>: mov %eax,%edx
So I'm guessing that np (in %rbx ?) contained something invalid...
----------------------------------------------------------------------
| Paul Goyette | PGP DSS Key fingerprint: | E-mail addresses: |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul@whooppee.com |
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette@juniper.net |
----------------------------------------------------------------------