tech-kern: v_interlock/splbio protocol violations

Subject: v_interlock/splbio protocol violations
To: None <tech-kern@netbsd.org>
From: Darrin B. Jewell <dbj@netbsd.org>
List: tech-kern
Date: 03/03/2004 16:12:35
The following list of rules stems from my understanding of the
v_interlock/splbio protocol:

 1. biodone() is frequently invoked from an IPL_BIO spl(9) level
    interrupt context, typically from inside a disk device driver.
 2. Therefore, data accessed from inside biodone() must be protected
    by splbio(9).
 3. All uses of any simple_lock(9) taken from biodone()
    must therefore be protected by splbio(9).
 4. Care is taken to avoid taking struct vnode's v_interlock
    from inside biodone() so that all locks of v_interlock
    do not have to be protected by splbio(9).
 5. Therefore, fields protected by v_interlock must not be
    accessed from inside biodone()

If you see a misunderstanding represented in the above list, please
correct me.

Unfortunately, these rules are violated in several places,
some new, some old.
  - biodone calls vwakeup for writes which accesses v_flag
  - asynchronous buffer i/o without a callback will invoke brelse
    from biodone.  This is used for many filesystem metadata writes.
    However, on an i/o error it may call brelvp and reassignbuf.
    - brelvp and reassignbuf access v_uobj.memq and v_flag
    - brelvp calls holdrele
       - holdrele was recently modified to take v_interlock,
         which violates rule 4
       - holdrele also takes vnode_free_list_slock.  Other
         uses of this simple_lock are not at splbio

Additionally, there are several vnode fields which
are currently protected only by splbio and not by any simplelock.
These include v_dirtyblkhd, v_cleanblkhd and v_synclist.

I'm looking at ways around these problems.  Some possiblities
include:
  1. always take splbio before taking v_interlock.
     - impractical and huge imapct on other code
  2. split v_interlock into two separate interlocks, one to
     protect stuff used inside biodone and one for everything
     else.
  3. Move some or most of the work done by biodone into a separate
     helper thread context, similar to aiodoned, which can safely take
     the v_interlock.

Right now, i'm favoring possibility #3, but I'm still examining the
problem.  I'm also considering putting v_dirtyblkhd and v_cleanblkhd
under the protection of the v_interlock and adding a new interlock
to protect syncer_workitem_pending/v_synclist.

So some background is in order.  Why did I notice this and why do I care?

I've recently been working on the problems which are described in pr
kern/24596 after independently observing the same large backlogs of
unflushed pages.  My current fix for this is to add a new trickle sync
page daemon, which walks a list of dirty pages similar to the active
and inactive lists and flushes them out to disk slowly over timestamp
based window.  This is working fine and does in fact solve the problem
discussed in that pr as well as adding a smoother trickle sync
than can be had by flushing an entire file at a time.

However, this does not provide any trickle sync support for buffers,
so I started looking into moving the list of dirty pages into the
object page queues and having the trickle syncer walk the dirty vnodes
and flush both buffers and pages slowly out to disk.  However, since
the trickle syncer needs to flush a few pages and buffers from many
dirty vnodes several times a second, I would like to be able to flush
buffers from a vnode without having to take the vnode's v_lock, which
has a large overhead.  Thus, I started looking closely at the locking
protocols surrounding vnodes and buffer management and noticed the
problems discussed above.  Hopefully, I'll be able to get this sorted
out and be able to improve our current trickle syncer which
suboptimally flushes entire files after a static delay interval.
But for now, one thing at a time.

Darrin