Subject: NetBSD 2.0 crashing in lstat(2) over NFS
To: None <tech-kern@netbsd.org>
From: Matthew Mondor <mm_lists@pulsar-zone.net>
List: tech-kern
Date: 01/02/2005 07:25:21
I am getting occasional crashes which I can quite effectively reproduce when
creating alot of file I/O load using either locate.updatedb or a cvs update
of a rather large tree, while at the same time causing lstat() calls to an
nfs server via a remotely mounted filesystem.

The problem could not be reproduced everytime stressin this test, and it
appears to no longer be possible to crash the system when using lstat on
already cached entries, they seem to need to trigger remote lstat(). What
I used was several instances of locate.updatedb while at the same time
find /nfs/ginseng | xargs ls -l

It was rather hard to obtain a core dump, the box was generally freezing
in a way that I couldn't get into ddb, nor get it to automatically reboot
with ddb disabled. However, once compiled with DIAGNOSTIC, it was possible
to obtain one... So here is some information which I could obtain. I still
have the coredump file and kernel image, and will query it as necessary on
request for further information.

This system is used as a desktop, using Linux binary compatibility support
with suse91 packages from pkgsrc-2004Q4, although this seems unrelated as of
yet, my server, also running NetBSD 2.0, without X11, without Linux compat
and serving that NFS filesystem I am mounting from this system seems stable.
Both systems are using soft dependencies with FFS2.

I could send a PR with this, but if doing so want to be sure to include all
necessary information, and possibly attached with a diff proposing a fix...
So here it the information I obtained yet:


VERSION
=3D=3D=3D=3D=3D=3D=3D

# uname -a
NetBSD hal.xisop 2.0 NetBSD 2.0 (HAL) #0: Sat Jan  1 15:00:45 EST 2005  roo=
t@hal.xisop:/usr/src/sys/arch/i386/compile/HAL i386


DMESG
=3D=3D=3D=3D=3D

# dmesg -M /var/crash/netbsd.3.core

[...]
uvm_fault(0xc71d0378, 0, 0, 2) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 2 eip c0173231 cs 8 eflags 10213 cr2 10c ilevel 0
panic: trap
syncing disks... 9 6 5 4 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 giving up
Printing vnodes for busy buffers
tag 1 type VBLK, usecount 14573, writecount 0, refcount 2958,
        tag VT_UFS, ino 3566716, on dev 0, 0 flags 0x0, effnlink 1, nlink 1
        mode 060640, owner 0, group 5, size 0 not locked
giving up
[...]


BACKTRACE
=3D=3D=3D=3D=3D=3D=3D=3D=3D

# gdb /usr/src/sys/arch/i386/compile/HAL/netbsd.gdb
(gdb) target kcore /var/crash/netbsd.3.core
panic: trap
#0  0xc0481000 in ?? ()
(gdb) bt
#0  0xc0481000 in ?? ()
#1  0xc0285e97 in cpu_reboot (howto=3D256, bootstr=3D0x0)
    at ../../../../arch/i386/i386/machdep.c:745
#2  0xc021a149 in panic (fmt=3D0xc0344f02 "trap")
    at ../../../../kern/subr_prf.c:242
#3  0xc028f62e in trap (frame=3D0xc8f3dda4)
    at ../../../../arch/i386/i386/trap.c:296
#4  0xc0102dd9 in calltrap ()
#5  0xc024b31a in VOP_INACTIVE (vp=3D0xc6cb56a0, p=3D0xc71ce660)
    at ../../../../kern/vnode_if.c:1024
#6  0xc02411ee in vput (vp=3D0xc6cb56a0) at ../../../../kern/vfs_subr.c:1322
#7  0xc0247289 in sys___lstat13 (l=3D0xc6f47c64, v=3D0xc8f3df64, retval=3D0=
xc8f3df5c)
    at ../../../../kern/vfs_syscalls.c:2210
#8  0xc028ef25 in syscall_plain (frame=3D0xc8f3dfa8)
    at ../../../../arch/i386/i386/syscall.c:156


VNODE
=3D=3D=3D=3D=3D

(gdb) print *(struct vnode *)0xc6cb56a0
$1 =3D {v_uobj =3D {vmobjlock =3D {lock_data =3D 0}, pgops =3D 0xc0382450, =
memq =3D {
      tqh_first =3D 0x0, tqh_last =3D 0xc6cb56a8}, uo_npages =3D 0, uo_refs=
 =3D 0},=20
  v_size =3D 7846, v_flag =3D 0, v_numoutput =3D 0, v_writecount =3D 0, v_h=
oldcnt =3D 0,=20
  v_mount =3D 0xc0a01a00, v_op =3D 0xc08d5b00, v_freelist =3D {tqe_next =3D=
 0x0,=20
    tqe_prev =3D 0xc76c8c34}, v_mntvnodes =3D {le_next =3D 0xc87c79d0,=20
    le_prev =3D 0xc0a01a14}, v_cleanblkhd =3D {lh_first =3D 0x0}, v_dirtybl=
khd =3D {
    lh_first =3D 0x0}, v_synclist =3D {le_next =3D 0x0, le_prev =3D 0x0}, v=
_dnclist =3D {
    lh_first =3D 0x0}, v_nclist =3D {lh_first =3D 0xc880b408}, v_un =3D {
    vu_mountedhere =3D 0x0, vu_socket =3D 0x0, vu_specinfo =3D 0x0,=20
    vu_fifoinfo =3D 0x0}, v_lease =3D 0x0, v_type =3D VREG, v_tag =3D VT_NF=
S,=20
  v_lock =3D {lk_interlock =3D {lock_data =3D 0}, lk_flags =3D 0, lk_sharec=
ount =3D 0,=20
    lk_exclusivecount =3D 0, lk_recurselevel =3D 0, lk_waitcount =3D 0,=20
    lk_wmesg =3D 0xc0342140 "vnlock", lk_un =3D {lk_un_sleep =3D {
        lk_sleep_lockholder =3D -1, lk_sleep_locklwp =3D 0, lk_sleep_prio =
=3D 20,=20
        lk_sleep_timo =3D 0}, lk_un_spin =3D {lk_spin_cpu =3D 4294967295}}}=
,=20
  v_vnlock =3D 0xc6cb5710, v_data =3D 0xc6f363e8, v_klist =3D {slh_first =
=3D 0x0}}


PROCESS
=3D=3D=3D=3D=3D=3D=3D

(gdb) print *(struct proc *)0xc71ce660
$2 =3D {p_list =3D {le_next =3D 0xc71ceb28, le_prev =3D 0xc03c0d28},=20
  p_cred =3D 0xc6a99510, p_fd =3D 0xc7177d28, p_cwdi =3D 0xc6a9b1e0,=20
  p_stats =3D 0xc8f3f00c, p_limit =3D 0xc6b323a8, p_vmspace =3D 0xc71d0378,=
=20
  p_sigacts =3D 0xc8f3e0a0, p_ksems =3D 0x0, p_exitsig =3D 20, p_flag =3D 1=
6386,=20
  p_stat =3D 2 '\002', p_pad1 =3D "=BE=AD=DE", p_pid =3D 1496, p_pglist =3D=
 {
    le_next =3D 0xc71ce000, le_prev =3D 0xc71ce368}, p_pptr =3D 0xc71ce330,=
=20
  p_sibling =3D {le_next =3D 0x0, le_prev =3D 0xc71ce37c}, p_children =3D {
    lh_first =3D 0x0}, p_lock =3D {lock_data =3D -559038737}, p_lwps =3D {
    lh_first =3D 0xc6f47c64}, p_raslist =3D {lh_first =3D 0x0}, p_nlwps =3D=
 1,=20
  p_nrlwps =3D 1, p_nzlwps =3D 0, p_nlwpid =3D 1, p_nstopchild =3D 0, p_sa =
=3D 0x0,=20
  p_estcpu =3D 14, p_cpticks =3D 43, p_pctcpu =3D 29, p_opptr =3D 0x0, p_ti=
mers =3D 0x0,=20
  p_rtime =3D {tv_sec =3D 0, tv_usec =3D 95536}, p_uticks =3D 0, p_sticks =
=3D 73,=20
  p_iticks =3D 0, p_traceflag =3D 0, p_tracep =3D 0x0, p_systrace =3D 0x0,=
=20
  p_textvp =3D 0xc730385c, p_emul =3D 0xc031f680, p_emuldata =3D 0x0, p_use=
rret =3D 0,=20
  p_userret_arg =3D 0x0, p_execsw =3D 0xc031f490, p_klist =3D {slh_first =
=3D 0x0},=20
  p_sigctx =3D {ps_siglist =3D {__bits =3D {0, 0, 0, 0}}, ps_sigcheck =3D 0=
 '\0',=20
    ps_sigwaited =3D 0x0, ps_sigwait =3D 0x0, ps_silock =3D {lock_data =3D =
0},=20
    ps_siginfo =3D {cqh_first =3D 0xc71ce750, cqh_last =3D 0xc71ce750}, ps_=
sigstk =3D {
      ss_sp =3D 0x0, ss_size =3D 0, ss_flags =3D 4}, ps_oldmask =3D {__bits=
 =3D {524288,=20
        0, 0, 0}}, ps_flags =3D 0, ps_signo =3D 0, ps_code =3D 0, ps_lwp =
=3D 0,=20
    ps_sigcode =3D 0x0, ps_sigmask =3D {__bits =3D {0, 0, 0, 0}}, ps_sigign=
ore =3D {
      __bits =3D {2554888192, 0, 0, 0}}, ps_sigcatch =3D {__bits =3D {0, 0,=
 0, 0}}},=20
  p_nice =3D 20 '\024', p_comm =3D "ls\0gs\0-4.0.2\0\0\0\0", p_pgrp =3D 0xc=
6ad5240,=20
  p_psstr =3D 0xbfbffff0, p_psargv =3D 0, p_psnargv =3D 4, p_psenv =3D 8,=20
  p_psnenv =3D 12, p_xstat =3D 48879, p_acflag =3D 0, p_ru =3D 0xdeadbeef, =
p_md =3D {
    md_flags =3D 0, md_syscall =3D 0xc028eea0 <syscall_plain>, md_astpendin=
g =3D 1}}

# ps ax -kpgid -o pid,paddr,ucomm -M /var/crash/netbsd.3.core | grep c71ce6=
60
1496 c71ce660 ls

Note that "p_vmspace =3D 0xc71d0378" refers to same address as uvm_fault()
displayed in dmesg.


Since this seems related to vfs and/or uvm, and that I don't have much
experience with those kernel parts personally, I thought it would be useful
to post this information here to get help into solving this problem.


Thanks,
Matt