netbsd-bugs: kern/35198: lfs_pchain corruption causing hang or panic

Subject: kern/35198: lfs_pchain corruption causing hang or panic
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <stix@stix.id.au>
List: netbsd-bugs
Date: 12/07/2006 12:00:01
>Number:         35198
>Category:       kern
>Synopsis:       lfs_pchain corruption causing hang or panic
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Dec 07 12:00:01 +0000 2006
>Originator:     Paul Ripke
>Release:        NetBSD 4.99.5
>Organization:
>Environment:
System: NetBSD hactar.stix.org.au 4.99.5 NetBSD 4.99.5 (GENERIC) #0: Sun Dec 3 15:54:49 EST 2006 stix@zion.stix.org.au:/export/netbsd/current/obj.i386/export/netbsd/current/src/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:

I've been investigating a reproducible lfs hang and crash that appear to be
related. From my investigation, it appears that the lfs_pchain tailqueue is
getting corrupted somehow.

The hang I'm seeing has lfs_pchain with one or more inodes on the chain
(usually just one), but none have IN_PAGING set, and it spins on the goto in
lfs_flush_pchain():

(gdb) bt
...
#14 0xc017f5e4 in pckbcintr (vsc=0xc08c7300) at /export/netbsd/current/src/sys/dev/ic/pckbc.c:640
#15 0xc030cb7f in intr_biglock_wrapper (vp=0xc0908040) at /export/netbsd/current/src/sys/arch/x86/x86/intr.c:544
#16 0xc0102f8c in Xintr_ioapic_edge1 ()
#17 0xc0232754 in lfs_flush_pchain (fs=0xc0d45800) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1344
#18 0xc022b866 in lfs_writerd (arg=0xca2cd5c8) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vfsops.c:236
#19 0xc01002e7 in proc_trampoline ()
(gdb) f 17
#17 0xc0232754 in lfs_flush_pchain (fs=0xc0d45800) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1344
warning: Source file is more recent than executable.
1344                    nip = TAILQ_NEXT(ip, i_lfs_pchain);
(gdb) list
1339             * fast and async.
1340             */
1341            simple_lock(&fs->lfs_interlock);
1342        top:
1343            for (ip = TAILQ_FIRST(&fs->lfs_pchainhd); ip != NULL; ip = nip) {
1344                    nip = TAILQ_NEXT(ip, i_lfs_pchain);
1345                    vp = ITOV(ip);
1346
1347                    if (!(ip->i_flags & IN_PAGING))
1348                            goto top;
(gdb) p fs->lfs_pchainhd
$1 = {tqh_first = 0xcb91ea40, tqh_last = 0xcb919044}
(gdb) p ip
$2 = (struct inode *) 0xcb91ea40
(gdb) p ip.i_flags
$4 = 0
(gdb) p ip->inode_ext.lfs->lfs_pchain.tqe_next
$5 = (struct inode *) 0x0

The crash I'm seeing is attempting to remove an inode from an empty
lfs_pchain after clearing IN_PAGING, in lfs_putpages():

(gdb) bt
...
#8  0xc0337e8e in trap (frame=0xcb2c5804) at /export/netbsd/current/src/sys/arch/i386/i386/trap.c:313
#9  0xc010c01e in calltrap ()
#10 0xc02382a5 in lfs_putpages (v=0xcb2c58f0) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1920
#11 0xc02f5510 in VOP_PUTPAGES (vp=0xcc2d359c, offlo=0, offhi=0, flags=27)
    at /export/netbsd/current/src/sys/kern/vnode_if.c:1592
#12 0xc02e81f8 in vinvalbuf (vp=0xcc2d359c, flags=1, cred=0xffffffff, l=0xcb029564, slpflag=0, slptimeo=0)
    at /export/netbsd/current/src/sys/kern/vfs_subr.c:715
#13 0xc02e862e in vclean (vp=0xcc2d359c, flags=<value optimized out>, l=0xcb029564)
    at /export/netbsd/current/src/sys/kern/vfs_subr.c:1555
#14 0xc02e8bc8 in vgonel (vp=0xcc2d359c, l=0xcb029564) at /export/netbsd/current/src/sys/kern/vfs_subr.c:1734
#15 0xc02e9044 in getcleanvnode (l=0xcb029564) at /export/netbsd/current/src/sys/kern/vfs_subr.c:272
#16 0xc02e91f0 in getnewvnode (tag=VT_LFS, mp=0xc0a74000, vops=0xc0967500, vpp=0xcb2c5bb8)
    at /export/netbsd/current/src/sys/kern/vfs_subr.c:561
#17 0xc0234f76 in lfs_set_dirop_create (dvp=0xcc22ba54, vpp=0xcb2c5bb8)
    at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:466
#18 0xc0235787 in lfs_create (v=0xcb2c5ab8) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:674
#19 0xc02f4c11 in VOP_CREATE (dvp=0xcc22ba54, vpp=0xcb2c5bb8, cnp=0xcb2c5bcc, vap=0xcb2c5afc)
    at /export/netbsd/current/src/sys/kern/vnode_if.c:164
#20 0xc02f28bd in vn_open (ndp=0xcb2c5ba8, fmode=1538, cmode=384)
    at /export/netbsd/current/src/sys/kern/vfs_vnops.c:169
#21 0xc02ef2a2 in sys_open (l=0xcb029564, v=0xcb2c5c48, retval=0xcb2c5c68)
    at /export/netbsd/current/src/sys/kern/vfs_syscalls.c:1171
#22 0xc03374f8 in syscall_plain (frame=0xcb2c5c88) at /export/netbsd/current/src/sys/arch/i386/i386/syscall.c:144
#23 0xc01006e0 in syscall1 ()
(gdb) f 10
#10 0xc02382a5 in lfs_putpages (v=0xcb2c58f0) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1919
warning: Source file is more recent than executable.
1920                            TAILQ_REMOVE(&fs->lfs_pchainhd, ip, i_lfs_pchain);
(gdb) list
1914                    simple_unlock(&vp->v_interlock);
1915
1916                    /* Remove us from paging queue, if we were on it */
1917                    simple_lock(&fs->lfs_interlock);
1918                    if (ip->i_flags & IN_PAGING) {
1919                            ip->i_flags &= ~IN_PAGING;
1920                            TAILQ_REMOVE(&fs->lfs_pchainhd, ip, i_lfs_pchain);
1921                    }
1922                    simple_unlock(&fs->lfs_interlock);
(gdb) p ip
$11 = (struct inode *) 0xcc2d4bf0
(gdb) p ip->i_flags
$12 = 0
(gdb) p fs->lfs_pchainhd
$13 = {tqh_first = 0x0, tqh_last = 0x0}

Reading the source, all lfs_pchain operations appear to be protected by
lfs_interlock, and very little code touches i_flags, so I'm at a loss as
to how this is happening.

FYI: line numbers mightn't quite match with current, I've been sprinkling
a few KASSERTs around to try to track this down. I can reproduce both the
hang and crash with both stock GENERIC and a custom kernel config based on
GENERIC.MP.

As you can see from the above, I have plenty of dumps available, and can
readily reproduce the problem as required.

>How-To-Repeat:
Both problems appear to be exacerbated by low memory; this ancient test
system only has 128 MB RAM, and even then, I can reduce the time to failure
from minutes to seconds by filling up RAM with processes. I have also tuned
down the ubc file cache:
ksh$ sysctl -a | egrep 'vm\.....m(in|ax)'
vm.anonmin = 10
vm.filemin = 5
vm.execmin = 5
vm.anonmax = 80
vm.filemax = 10
vm.execmax = 30

Both problems will occur when either unpacking pkgsrc onto a pristine LFS
file system or when /etc/daily runs a find(1) over an LFS file system
containing pkgsrc. Keeping as much memory free as possible allows the
unpacking to succeed, occasionally.

>Fix:
unknown