tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

getiobuf(x, false) can sleep ?



Hi,
A server with a large wedge device started panicing under high I/O load
(and I guess memory pressure) on a KASSERT(!ISSET(bp->b_oflags, BO_DONE))
in biodone(). The stack trace was always:
db{0}>  tr
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x24d
__kernassert() at netbsd:__kernassert+0x2d
biodone() at netbsd:biodone+0xc4
dkiodone() at netbsd:dkiodone+0xa3
biodone2() at netbsd:biodone2+0x95
biointr() at netbsd:biointr+0x3c

so I started suspecting buffer list corruption at the dkwedge level.
I added instrumentation to check
- that buffer queue manipulations are done at splbio() with the
  kernel_lock held in dkstart() and dkstrategy():
          KASSERT(curcpu()->ci_biglock_count > 0);
          KASSERT(curcpu()->ci_ilevel >= IPL_BIO);
          KASSERT(curlwp->l_blcnt > 0); 
- that the buffer doesn't change under us in dkstart():
        KASSERT(BUFQ_GET(sc->sc_bufq) == bp);

and this last KASSERT fired:
        db{7}> tr
        breakpoint() at netbsd:breakpoint+0x5
        panic() at netbsd:panic+0x24d
        __kernassert() at netbsd:__kernassert+0x2d
        dkstart() at netbsd:dkstart+0x2f2
        dkstrategy() at netbsd:dkstrategy+0xd2
        bdev_strategy() at netbsd:bdev_strategy+0x50
        spec_strategy() at netbsd:spec_strategy+0x5e
        VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x65
        bwrite() at netbsd:bwrite+0x192 
        VOP_BWRITE() at netbsd:VOP_BWRITE+0x6e
        ffs_full_fsync() at netbsd:ffs_full_fsync+0x292
        ffs_fsync() at netbsd:ffs_fsync+0x5d
        VOP_FSYNC() at netbsd:VOP_FSYNC+0x71
        sched_sync() at netbsd:sched_sync+0x15d

(FWIW, CPU 0 was doing:
        db{7}> mach cpu 0
        using CPU 0
        db{7}> tr
        _kernel_lock() at netbsd:_kernel_lock+0x12d
        intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x16
        Xintr_ioapic_level2() at netbsd:Xintr_ioapic_level2+0xf7
        --- interrupt ---
        Xspllower() at netbsd:Xspllower+0xe
        ubc_release() at netbsd:ubc_release+0x87
        ubc_uiomove() at netbsd:ubc_uiomove+0xe4
        ffs_write() at netbsd:ffs_write+0x667
        VOP_WRITE() at netbsd:VOP_WRITE+0x66 
        vn_write() at netbsd:vn_write+0xce
        dofilewrite() at netbsd:dofilewrite+0x81
        sys_write() at netbsd:sys_write+0x72
        syscall() at netbsd:syscall+0xb6
other CPUs were in the idle loop).



Now, given that the other KASSERT didn't fire I guess the only way this can
happen is that the thread did sleep between the BUFQ_PEEK() and
BUFQ_GET(). The only candidate is getiobuf(sc->sc_parent->dk_rawvp, false).

When called this way getiobuf() will do pool_cache_get(bufio_cache, PR_NOWAIT).
Does anyone see if this can sleep somewhere despite the PR_NOWAIT ?
Maybe in some low-level UVM or pmap operation ?

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index