Subject: Re: Is 1.6 NFS buggy?
To: None <kpneal@pobox.com>
From: Artem Belevich <art@riverstonenet.com>
List: current-users
Date: 12/02/2002 10:20:34
I've been having this issue for quite a while. It looks like a race
conndition, but what causes it is still unknown.
I've hacked up a workaround that checks if we've got such pathological
condition and complains but keeps system running if so.
See kern/17107 for the patch.
http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=17107
It works on i386. I get messages once in a while but so far it
seems to be working well enough.
--Artem
On Fri, Nov 29, 2002 at 11:15:52PM -0500, kpneal@pobox.com wrote:
> I've got a crash on my Alpha running 1.6. The crash is in NFS and I'm
> wondering if anyone else has seen it.
>
> Hand-written traceback:
>
> nfs_reclaim+0x80
> vclean+0x258
> vgonel+0x70
> getnewvnode+0x310
> ffs_vget+0x8c
> vfs_lookup+0x1028
> lookup+0x4bc
> namei+0x4c8
> sys__lstat13+0x58
> syscall_plain+0x154
> syscall 280
>
> The offending line of code:
>
> /usr/src/sys16/nfs/nfs_node.c:285
> 93c: 00 00 6a a0 ldl t2,0(s1)
> 940: 83 16 61 48 srl t2,0x8,t2 ********* Blamo!
> 944: 18 00 60 e0 blbc t2,9a8 <nfs_reclaim+0xe8>
> 948: 88 00 49 a4 ldq t1,136(s0)
> 94c: 16 00 40 e4 beq t1,9a8 <nfs_reclaim+0xe8>
> /usr/src/sys16/nfs/nfs_node.c:286
>
> /*
> * For nqnfs, take it off the timer queue as required.
> */
> ---> if ((nmp->nm_flag & NFSMNT_NQNFS) && np->n_timer.cqe_next != 0) {
> CIRCLEQ_REMOVE(&nmp->nm_timerhead, np, n_timer);
> }
>
>
> The result is this nastyness:
>
> CPU 0: fatal kernel trap:
>
> CPU 0 trap entry = 0x4 (unaligned access fault)
> CPU 0 a0 = 0xdeadbeefdeadbeef
> CPU 0 a1 = 0x28
> CPU 0 a2 = 0x3
> CPU 0 pc = 0xfffffc0000351e20
> CPU 0 ra = 0xfffffc0000448f38
> CPU 0 pv = 0xfffffc0000351da0
> CPU 0 curproc = 0xfffffc0000a37448
> CPU 0 pid = 17495, comm = find
>
> panic: trap
> tlp0: receive ring overrun
> tlp1: receive ring overrun
> syncing disks... panic: lockmgr: locking against myself
>
> dumping to dev 8,9 offset 298595
> dump 48 47 46 45 44 43 42 41 40 39 38 37 36 35
> unexpected machine check:
>
> mces = 0x1
> vector = 0x670
> param = 0xfffffc0000006048
> pc = 0xfffffc0000307be8
> ra = 0xfffffc0000307bd4
> code = 0x100000084
> curproc = 0xfffffc0000a37448
> pid = 17495, comm = find
>
> panic: machine check
>
> dumping to dev 8,9 offset 298595
> dump device not ready
>
> --
> Kevin P. Neal http://www.pobox.com/~kpn/
>
> "It sounded pretty good, but it's hard to tell how it will work out
> in practice." -- Dennis Ritchie, ~1977, "Summary of a DEC 32-bit machine"
>