vnode lock and v_numoutput

To: tech-kern%netbsd.org@localhost
Subject: vnode lock and v_numoutput
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Sat, 24 Jan 2015 14:34:24 +0100

Hello,
I have what looks like a deadlock in a Xen dom0 when using
file-backed virtual disk and HVM domUs (the dom0 is running netbsd-6).
In this setup, a file backing a virtual disk may be acceded by both
a vnd(4) and a qemu-dm userland process.
I end up in this situation:
qemu-dm is blocked on:
trace: pid 5171 lid 5 at 0xffffa0005e744740
sleepq_block() at netbsd:sleepq_block+0xc5
cv_wait() at netbsd:cv_wait+0xf2
genfs_do_putpages() at netbsd:genfs_do_putpages+0xa0e
VOP_PUTPAGES() at netbsd:VOP_PUTPAGES+0x5b
vflushbuf() at netbsd:vflushbuf+0x4b
ffs_full_fsync() at netbsd:ffs_full_fsync+0x143
ffs_fsync() at netbsd:ffs_fsync+0x4b
VOP_FSYNC() at netbsd:VOP_FSYNC+0x5f
sys_fsync() at netbsd:sys_fsync+0x51
syscall() at netbsd:syscall+0xc4

(gdb) l *(genfs_do_putpages+0xa0e)
0xffffffff801bc7d4 is in genfs_do_putpages (../../../../miscfs/genfs/genfs_io.c:1246).
1241    skip_scan:
1242    #endif /* !defined(DEBUG) */
1243    
1244            /* Wait for output to complete. */
1245            if (!wasclean && !async && vp->v_numoutput != 0) {
1246                    while (vp->v_numoutput != 0)
1247                            cv_wait(&vp->v_cv, slock);
1248            }
1249            onworklst = (vp->v_iflag & VI_ONWORKLST) != 0;
1250            mutex_exit(slock);

while the vnd is blocked on:
db>   tr/a ffffa00005416280
trace: pid 0 lid 88 at 0xffffa0005c100a40
sleepq_block() at netbsd:sleepq_block+0xc5
turnstile_block() at netbsd:turnstile_block+0x3c6
rw_vector_enter() at netbsd:rw_vector_enter+0x17a
genfs_lock() at netbsd:genfs_lock+0x9f
VOP_LOCK() at netbsd:VOP_LOCK+0x53
vn_lock() at netbsd:vn_lock+0xd9
vndthread() at netbsd:vndthread+0x43d

(gdb) l *(vndthread+0x43d)
0xffffffff804c8faa is in vndthread (../../../../dev/vnd.c:818).
813                     daddr_t nbn;
814                     int off, nra;
815     
816                     nra = 0;
817                     vn_lock(vnd->sc_vp, LK_EXCLUSIVE | LK_RETRY);
818                     error = VOP_BMAP(vnd->sc_vp, bn / bsize, &vp, &nbn, &nra);
819                     VOP_UNLOCK(vnd->sc_vp);
820     
821                     if (error == 0 && (long)nbn == -1)
822                             error = EIO;


I guess what happens is:
vndthread is blocked trying to get the vn_lock because the  thread
doing the sys_fsync() is holding it.
The thread in sys_fsync() waits for v_numoutput to go down to 0,
but this won't happen because vndthread has increased v_numoutput but
has not queued the I/O yet.

Should a thread hold the vnode lock (vn_lock()) before increasing v_numoutput ?
If so vnd(4) gets it wrong, and it doens't look easy to fix :(
I don't know if handle_with_rdwr() holds the vn_lock() at all.
handle_with_strategy() gets the vn_lock(), but it increases the
v_numoutput grabbing the lock. I guess it should also handle the lock
when calling nestiobuf_setup()

Any idea or comment ?

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Follow-Ups:
- Re: vnode lock and v_numoutput
  - From: Taylor R Campbell

Prev by Date: Re: NTFS: node leak
Next by Date: Re: vnode lock and v_numoutput
Previous by Thread: 6.1/amd64 panic in cpu_switchto()
Next by Thread: Re: vnode lock and v_numoutput
Indexes:

Home | Main Index | Thread Index | Old Index