Subject: More LFS Woes
To: None <current-users@netbsd.org>
From: Gary Duzan <gary@duzan.org>
List: tech-kern
Date: 02/02/2003 17:19:00
   Ok, I thought my LFS issue (process hang in lfsresbu) had gone
away, but it looks like it hasn't. What was making the hang go away
was a reboot, but after a while of beating on the disk it would
hang again. Turning on DEBUG/LFS_DEBUG generated complaints from
lfs_fits_buf around the time it hangs, with a negative locked_queue_rcount,
which I understand should never happen. Naturally, rebooting would
clear the locked_queue_rcount, which makes sense, but I suspect it
is just a symptom of a deeper problem.

   Turning on DIAGNOSTIC got me an assertion failure:

===========================================================================
panic: kernel %sassertion "%s" failed: file "%s", line %d
#0  0x1 in ?? ()
(gdb) where
#0  0x1 in ?? ()
#1  0xc025ef9e in cpu_reboot (howto=260, bootstr=0x0)
    at /usr/src/sys/arch/i386/i386/machdep.c:2449
#2  0xc01ca039 in db_reboot_cmd () at /usr/src/sys/ddb/db_command.c:669
#3  0xc01c9d14 in db_command (last_cmdp=0xc03aec94, cmd_table=0xc033c72c)
    at /usr/src/sys/ddb/db_command.c:455
#4  0xc01c9913 in db_command_loop () at /usr/src/sys/ddb/db_command.c:246
#5  0xc01cd404 in db_trap (type=1, code=0) at /usr/src/sys/ddb/db_trap.c:97
#6  0xc025a92f in kdb_trap (type=1, code=0, regs=0xe411bb20)
    at /usr/src/sys/arch/i386/i386/db_interface.c:224
#7  0xc0266ccb in trap (frame={tf_gs = 16, tf_fs = 48, tf_es = -468647920, 
      tf_ds = 131088, tf_edi = 256, tf_esi = -1069899008, tf_ebp = -468599968, 
      tf_ebx = -468599924, tf_edx = -1070309853, tf_ecx = 3840, tf_eax = 3099, 
      tf_trapno = 1, tf_err = 0, tf_eip = -1071273448, tf_cs = 8, 
      tf_eflags = 514, tf_esp = -468599936, tf_ss = -1071663199, 
      tf_vm86_es = -473497248, tf_vm86_ds = 120, tf_vm86_fs = -474952408, 
      tf_vm86_gs = 0}) at /usr/src/sys/arch/i386/i386/trap.c:285
#8  0xc0102c0c in calltrap ()
#9  0xc01fb7a1 in panic (
    fmt=0xc03aa300 "kernel %sassertion \"%s\" failed: file \"%s\", line %d")
    at /usr/src/sys/kern/subr_prf.c:227
#10 0xc031524b in __assert () at /usr/src/sys/lib/libkern/__assert.c:47
#11 0xc01b1f39 in lfs_reserve (fs=0xc0dfd400, vp=0xe3c70160, vp2=0xe3b0cd28, 
    fsb=120) at /usr/src/sys/ufs/lfs/lfs_bio.c:266
#12 0xc01c1034 in lfs_set_dirop (vp=0xe3c70160, vp2=0xe3b0cd28)
    at /usr/src/sys/ufs/lfs/lfs_vnops.c:381
#13 0xc01c18cb in lfs_remove (v=0xe411bcbc)
    at /usr/src/sys/ufs/lfs/lfs_vnops.c:632
#14 0xc01c6da8 in ufs_rename (v=0xe411be70) at /usr/src/sys/sys/vnode_if.h:686
#15 0xc01c1f0f in lfs_rename (v=0xe411be70)
    at /usr/src/sys/ufs/lfs/lfs_vnops.c:741
can not access 0xbfbfd5ec, invalid translation (invalid PDE)
can not access 0xbfbfd5ec, invalid translation (invalid PDE)
can not access 0xbfbfd5ec, invalid translation (invalid PDE)
can not access 0xbfbfd5ec, invalid translation (invalid PDE)
can not access 0xbfbfd5ec, invalid translation (invalid PDE)
can not access 0xbfbfd5ec, invalid translation (invalid PDE)
#16 0xc0220523 in rename_files (
    from=0xbfbfd5ec <Address 0xbfbfd5ec out of bounds>can not access 0xbfbff623, invalid translation (invalid PDE)
can not access 0xbfbff623, invalid translation (invalid PDE)
can not access 0xbfbff623, invalid translation (invalid PDE)
can not access 0xbfbff623, invalid translation (invalid PDE)
can not access 0xbfbff623, invalid translation (invalid PDE)
can not access 0xbfbff623, invalid translation (invalid PDE)
, 
    to=0xbfbff623 <Address 0xbfbff623 out of bounds>, p=0xe37944cc, retain=0)
    at /usr/src/sys/sys/vnode_if.h:756
#17 0xc02201f7 in sys_rename (l=0xe36f8950, v=0xe411bf80, retval=0xe411bf78)
    at /usr/src/sys/kern/vfs_syscalls.c:2840
#18 0xc0266697 in syscall_plain (frame={tf_gs = 31, tf_fs = 31, tf_es = 31, 
      tf_ds = 31, tf_edi = -1077938653, tf_esi = -1077946900, 
      tf_ebp = -1077945876, tf_ebx = 0, tf_edx = 0, tf_ecx = 1208984620, 
      tf_eax = 128, tf_trapno = 3, tf_err = 2, tf_eip = 1208526951, 
      tf_cs = 23, tf_eflags = 647, tf_esp = -1077946944, tf_ss = 31, 
      tf_vm86_es = 0, tf_vm86_ds = 0, tf_vm86_fs = 0, tf_vm86_gs = 0})
    at /usr/src/sys/arch/i386/i386/syscall.c:156
#19 0xc0100b1f in syscall1 ()
can not access 0xbfbfd9ec, invalid translation (invalid PDE)
can not access 0xbfbfd9ec, invalid translation (invalid PDE)
Cannot access memory at address 0xbfbfd9ec
===========================================================================
int
lfs_reserve(struct lfs *fs, struct vnode *vp, struct vnode *vp2, int fsb)
{
        int error;
        int cantwait;

        KASSERT(fsb < 0 || VOP_ISLOCKED(vp));
        KASSERT(vp2 == NULL || fsb < 0 || VOP_ISLOCKED(vp2));
        KASSERT(vp2 == NULL || !(VTOI(vp2)->i_flag & IN_ADIROP));
===========================================================================

   The last line is the one that fails; vp2 is not NULL and the
IN_ADIROP flag is set. I guess the first question is whether or
not this assertion is valid. Browsing the stack it doesn't seem
that anything extraordinary had to happen to get to this state,
but it does a lot of work before the assertion fails. (A build
distribution gets most of the way through the main install phase
before it craps out.)

   Any ideas about what is going on here? I have a good core dump,
and I can reliably recreate the problem, so suggestions on things
to check and/or try would be appreciated.

					Gary Duzan