current-users: Re: Is 1.6 NFS buggy?

Subject: Re: Is 1.6 NFS buggy?
To: None <kpneal@pobox.com>
From: Artem Belevich <art@riverstonenet.com>
List: current-users
Date: 12/02/2002 10:20:34
I've been having this issue for quite a while. It looks like a race
conndition, but what causes it is still unknown.

I've hacked up a workaround that checks if we've got such pathological
condition and complains but keeps system running if so.  

See kern/17107 for the patch.
http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=17107 

It works on i386. I get messages once in a while but so far it
seems to be working well enough.

--Artem

On Fri, Nov 29, 2002 at 11:15:52PM -0500, kpneal@pobox.com wrote:
> I've got a crash on my Alpha running 1.6. The crash is in NFS and I'm
> wondering if anyone else has seen it. 
> 
> Hand-written traceback:
> 
> nfs_reclaim+0x80
> vclean+0x258
> vgonel+0x70
> getnewvnode+0x310
> ffs_vget+0x8c
> vfs_lookup+0x1028
> lookup+0x4bc
> namei+0x4c8
> sys__lstat13+0x58
> syscall_plain+0x154
> syscall 280
> 
> The offending line of code:
> 
> /usr/src/sys16/nfs/nfs_node.c:285
>  93c:   00 00 6a a0     ldl     t2,0(s1)
>  940:   83 16 61 48     srl     t2,0x8,t2        ********* Blamo!
>  944:   18 00 60 e0     blbc    t2,9a8 <nfs_reclaim+0xe8>
>  948:   88 00 49 a4     ldq     t1,136(s0)
>  94c:   16 00 40 e4     beq     t1,9a8 <nfs_reclaim+0xe8>
> /usr/src/sys16/nfs/nfs_node.c:286
> 
>         /*
>          * For nqnfs, take it off the timer queue as required.
>          */
> --->    if ((nmp->nm_flag & NFSMNT_NQNFS) && np->n_timer.cqe_next != 0) {
>                 CIRCLEQ_REMOVE(&nmp->nm_timerhead, np, n_timer);
>         }
> 
> 
> The result is this nastyness:
> 
> CPU 0: fatal kernel trap:
> 
> CPU 0    trap entry = 0x4 (unaligned access fault)
> CPU 0    a0         = 0xdeadbeefdeadbeef
> CPU 0    a1         = 0x28
> CPU 0    a2         = 0x3
> CPU 0    pc         = 0xfffffc0000351e20
> CPU 0    ra         = 0xfffffc0000448f38
> CPU 0    pv         = 0xfffffc0000351da0
> CPU 0    curproc    = 0xfffffc0000a37448
> CPU 0        pid = 17495, comm = find
> 
> panic: trap
> tlp0: receive ring overrun
> tlp1: receive ring overrun
> syncing disks... panic: lockmgr: locking against myself
> 
> dumping to dev 8,9 offset 298595
> dump 48 47 46 45 44 43 42 41 40 39 38 37 36 35 
> unexpected machine check:
> 
>     mces    = 0x1
>     vector  = 0x670
>     param   = 0xfffffc0000006048
>     pc      = 0xfffffc0000307be8
>     ra      = 0xfffffc0000307bd4
>     code    = 0x100000084
>     curproc = 0xfffffc0000a37448
>         pid = 17495, comm = find
> 
> panic: machine check
> 
> dumping to dev 8,9 offset 298595
> dump device not ready
> 
> -- 
> Kevin P. Neal                                http://www.pobox.com/~kpn/
> 
> "It sounded pretty good, but it's hard to tell how it will work out
> in practice." -- Dennis Ritchie, ~1977, "Summary of a DEC 32-bit machine"
>