Re: getiobuf(x, false) can sleep ?

To: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Subject: Re: getiobuf(x, false) can sleep ?
From: Andrew Doran <ad%NetBSD.org@localhost>
Date: Fri, 2 Apr 2010 13:01:12 +0000

On Fri, Apr 02, 2010 at 11:01:10AM +0200, Manuel Bouyer wrote:
> Hi,
> A server with a large wedge device started panicing under high I/O load
> (and I guess memory pressure) on a KASSERT(!ISSET(bp->b_oflags, BO_DONE))
> in biodone(). The stack trace was always:
> db{0}>  tr
> breakpoint() at netbsd:breakpoint+0x5
> panic() at netbsd:panic+0x24d
> __kernassert() at netbsd:__kernassert+0x2d
> biodone() at netbsd:biodone+0xc4
> dkiodone() at netbsd:dkiodone+0xa3
> biodone2() at netbsd:biodone2+0x95
> biointr() at netbsd:biointr+0x3c
> 
> so I started suspecting buffer list corruption at the dkwedge level.
> I added instrumentation to check
> - that buffer queue manipulations are done at splbio() with the
>   kernel_lock held in dkstart() and dkstrategy():
>         KASSERT(curcpu()->ci_biglock_count > 0);
>         KASSERT(curcpu()->ci_ilevel >= IPL_BIO);
>         KASSERT(curlwp->l_blcnt > 0); 
> - that the buffer doesn't change under us in dkstart():
>       KASSERT(BUFQ_GET(sc->sc_bufq) == bp);

Should this not be bufq_peek()?

Hmm, it shouldn't be doing dkiodone -> dkstart -> VOP_STRATEGY.
VOP_STRATEGY should be called with process context (kthread, user).
Anyhow that's unlikely to fix your problem.

> 
> and this last KASSERT fired:
>       db{7}> tr
>       breakpoint() at netbsd:breakpoint+0x5
>       panic() at netbsd:panic+0x24d
>       __kernassert() at netbsd:__kernassert+0x2d
>       dkstart() at netbsd:dkstart+0x2f2
>       dkstrategy() at netbsd:dkstrategy+0xd2
>       bdev_strategy() at netbsd:bdev_strategy+0x50
>       spec_strategy() at netbsd:spec_strategy+0x5e
>       VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x65
>       bwrite() at netbsd:bwrite+0x192 
>       VOP_BWRITE() at netbsd:VOP_BWRITE+0x6e
>       ffs_full_fsync() at netbsd:ffs_full_fsync+0x292
>       ffs_fsync() at netbsd:ffs_fsync+0x5d
>       VOP_FSYNC() at netbsd:VOP_FSYNC+0x71
>       sched_sync() at netbsd:sched_sync+0x15d
> 
> (FWIW, CPU 0 was doing:
>       db{7}> mach cpu 0
>       using CPU 0
>       db{7}> tr
>       _kernel_lock() at netbsd:_kernel_lock+0x12d
>       intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x16
>       Xintr_ioapic_level2() at netbsd:Xintr_ioapic_level2+0xf7
>       --- interrupt ---
>       Xspllower() at netbsd:Xspllower+0xe
>       ubc_release() at netbsd:ubc_release+0x87
>       ubc_uiomove() at netbsd:ubc_uiomove+0xe4
>       ffs_write() at netbsd:ffs_write+0x667
>       VOP_WRITE() at netbsd:VOP_WRITE+0x66 
>       vn_write() at netbsd:vn_write+0xce
>       dofilewrite() at netbsd:dofilewrite+0x81
>       sys_write() at netbsd:sys_write+0x72
>       syscall() at netbsd:syscall+0xb6
> other CPUs were in the idle loop).
> 
> 
> 
> Now, given that the other KASSERT didn't fire I guess the only way this can
> happen is that the thread did sleep between the BUFQ_PEEK() and
> BUFQ_GET(). The only candidate is getiobuf(sc->sc_parent->dk_rawvp, false).
> 
> When called this way getiobuf() will do pool_cache_get(bufio_cache, 
> PR_NOWAIT).
> Does anyone see if this can sleep somewhere despite the PR_NOWAIT ?
> Maybe in some low-level UVM or pmap operation ?
> 
> -- 
> Manuel Bouyer <bouyer%antioche.eu.org@localhost>
>      NetBSD: 26 ans d'experience feront toujours la difference
> --

Follow-Ups:
- Re: getiobuf(x, false) can sleep ?
  - From: Manuel Bouyer

References:
- getiobuf(x, false) can sleep ?
  - From: Manuel Bouyer

Prev by Date: Re: getiobuf(x, false) can sleep ?
Next by Date: Re: getiobuf(x, false) can sleep ?
Previous by Thread: Re: getiobuf(x, false) can sleep ?
Next by Thread: Re: getiobuf(x, false) can sleep ?
Indexes:

Home | Main Index | Thread Index | Old Index