kern/44809: strange vclean() crash

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/44809: strange vclean() crash
From: martin%NetBSD.org@localhost
Date: Thu, 31 Mar 2011 07:45:00 +0000 (UTC)

>Number:         44809
>Category:       kern
>Synopsis:       strange vclean() crash
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Mar 31 07:45:00 +0000 2011
>Originator:     Martin Husemann
>Release:        NetBSD 5.99.48
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD after-hours.aprisoft.de 5.99.48 NetBSD 5.99.48 (MODULAR) #36: 
Wed Mar 30 11:57:17 CEST 2011 
martin%after-hours.aprisoft.de@localhost:/usr/src/sys/arch/sparc64/compile/MODULAR
 sparc64
Architecture: sparc64
Machine: sparc64
>Description:

Every now and then this machine crashes, in what looks like always the same
way:

trap type 0x34: cpu 0, pc=15176e4 npc=15176e8 pstate=0x820006<PRIV,IE>
kernel trap 34: mem address not aligned
Stopped in pid 10502.1 (find) at        netbsd:VOP_LOCK+0x64:   jmpl           
%g1 + %g0], %o7
db{0}> bt
vclean(f242b40, 8, 0, 0, 96, 0) at netbsd:vclean+0xa8
getcleanvnode(f242b40, 0, f277400, 15, 15, 18cd400) at 
netbsd:getcleanvnode+0x15c
getnewvnode(1, d6e6030, d5e0830, 114d55f0, d1f0454, 0) at 
netbsd:getnewvnode+0x74
ffs_vget(d6e6030, 3bcd72, 114d5730, 4, 4000, d1f04f0) at netbsd:ffs_vget+0x20
ufs_lookup(0, 2c4, 300, 3fff, 2, 2) at netbsd:ufs_lookup+0x740
VOP_LOOKUP(17eb4a60, 114d5b40, 114d5b68, 179af10, badcafe, 0) at 
netbsd:VOP_LOOKUP+0xac
do_lookup(f277400, 114d5b20, 10, 10, 0, 114d58e8) at netbsd:do_lookup+0x48c
namei(114d5b20, 114d5b98, badcafe, 114d5b20, badcafe, badcafe) at 
netbsd:namei+0x14c
do_sys_stat(0, 0, 114d5c68, badcafe, badcafe, badcafe) at 
netbsd:do_sys_stat+0x38
sys___lstat50(f277400, 114d5dc0, 114d5e00, 4074f6a0, 4093f160, 4093f138) at 
netbsd:sys___lstat50+0x10
syscall_plain(114d5ed0, 114d5f50, 40744a88, 24f, 40744a88, c00) at 
netbsd:syscall_plain+0x138
?(40a02880, 40a028b0, 0, 1, 0, 40a203a0) at 0x1008f58
db{0}> show vnode 0xf242b40
OBJECT 0xf242b40: locked=0, pgops=0x170e708, npages=0, refs=-2147483647

VNODE flags 0x1010<MPSAFE,XLOCK>
mp 0x0 numoutput 0 size 0xffffffffffffffff writesize 0xffffffffffffffff
data 0x10073910 writecount 0 holdcnt 0
tag VT_MFS(3) type VBLK(3) mount 0x0 typedata 0x100a3cd0
v_lock 0xf242c48
  
crash happens here:

netbsd:VOP_LOCK+0x5c:   ldx             [%i0 + 0x98], %g2
netbsd:VOP_LOCK+0x60:   ldx             [%g2 + 0xf8], %g1
netbsd:VOP_LOCK+0x64:   jmpl            [%g1 + %g0], %o7
netbsd:VOP_LOCK+0x68:   add             %fp, 0x7d7, %o0

%i0 is clearly bogus:
i0          0x2000

so we end up with garbage:
g1          0x39d77614b250ef8d
g2          0xe78ee10


In source terms, this is at:
(gdb) list *(VOP_LOCK+0x64)
0x15176e4 is in VOP_LOCK (../../../../kern/vnode_if.c:1103).
1098            a.a_desc = VDESC(vop_lock);
1099            a.a_vp = vp;
1100            a.a_flags = flags;
1101            mpsafe = (vp->v_vflag & VV_MPSAFE);
1102            if (!mpsafe) { KERNEL_LOCK(1, curlwp); }
1103            error = (VCALL(vp, VOFFSET(vop_lock), &a));
1104            if (!mpsafe) { KERNEL_UNLOCK_ONE(curlwp); }
1105            return error;
1106    }

and called from:

(gdb) list *(vclean+0xa8)
0x1502968 is in vclean (../../../../kern/vfs_subr.c:1854).
1849            vp->v_iflag &= ~(VI_TEXT|VI_EXECMAP);
1850            active = (vp->v_usecount & VC_MASK) > 1;
1851    
1852            /* XXXAD should not lock vnode under layer */
1853            mutex_exit(&vp->v_interlock);
1854            VOP_LOCK(vp, LK_EXCLUSIVE);
1855    
1856            /*
1857             * Clean out any cached data associated with the vnode.
1858             * If purging an active vnode, it must be closed and

This are all mounts involved:

/dev/sd0a on / type ffs (log, local)
kernfs on /kern type kernfs (local)
ptyfs on /dev/pts type ptyfs (local)
procfs on /proc type procfs (local)

Any ideas what to examine when it happens next time?

>How-To-Repeat:

No idea, just happens "sometimes" for me.

>Fix:

Prev by Date: Re: kern/44806 (X/console switching is unreliable (regression))
Next by Date: Re: kern/23876 (ioapic(?) causes problems for my fxp0 NIC on amd64.)
Previous by Thread: Re: kern/44806 (X/console switching is unreliable (regression))
Next by Thread: PR/44301 CVS commit: pkgsrc/devel/atf
Indexes:

Home | Main Index | Thread Index | Old Index