Source-Changes archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: CVS commit: [simonb-wapbl] src/sys/kern



[follow-ups to tech-kern; BCC'd source-changes on this message.]

On Jul 28, 2008, at 7:02 PM, Greg Oster wrote:

I'm still not sure exactly what's going on, but here are some notes
on what I believe to be happening:

1) General writes are done as RW_READERs.

2) Log flushing (and fyncs) are done as RW_WRITERs.

3) When RW_WRITER finishes, all RW_READERS are signaled to "go", so
there is a 'thundering herd' after a log flush.

4) With a number of meta-data-generating processes (e.g. 10x tar
-xf's) a huge amount of meta-data can be generated in a short amount
of time.

5) By adding instrumentation to
 src/sys/miscfs/syncfs/sync_subr.c:vn_syncer_add_to_worklist()
what I've seen is that some 85000+ items get queued in under 10
seconds, such that by the time the loop in sched_sync() gets around
to handling the first major onslaught, there are now some 15000-20000
items waiting on the next queue.  And it just gets worse after that.

6) Each of these queue items needs to be handled with VOP_FSYNC().

7) I believe that's going to end up calling ffs_fsync(), which is
going to call wapbl_flush(..,0).

8) Now wapbl_flush() is going to need to get the RW_WRITER lock.
So perhaps it has to wait for the RW_READERs to finish before going
through and that's what we see as a delay?  Or perhaps it has
something to do with when that *one flush* gets done, rw_exit()
turns all the READERs loose again?

Hmm.

         * If we are releasing a write lock, then prefer to wake all
         * outstanding readers.  Otherwise, wake one writer if there
         * are outstanding readers, or all writers if there are no
         * pending readers.  If waking one specific writer, the writer
         * is handed the lock here.  If waking multiple writers, we
         * set WRITE_WANTED to block out new readers, and let them
         * do the work of acquring the lock in rw_vector_enter().

So, my guess is that there is writer starvation, which is exactly why pthread_rwlock *ALWAYS* prefers writers (per SUSv3). However, it seems as though kern_rwlock will sometimes prefer readers. When you always acquire as a writer, then you fall into the "wake all writers" case because there are never outstanding readers. You get thundering-herd, but luck in how the scheduler is implemented allows the journal flush through.

It might be worth tweaking rw_vector_exit() to always prefer writers, as an experiment. Maybe something like this:

        if (wcnt != 0) {
                RW_DASSERT(rw, (owner & RW_WRITE_WANTED) != 0);

                /* Direct hand-off to the longest-waiting writer. */
                l = TS_FIRST(ts, TS_WRITER_Q);
                new = (uintptr_t)l | RW_WRITE_LOCKED;

                if (wcnt > 1)
                        new |= RW_HAS_WAITERS | RW_WRITE_WANTED;
                else if (rcnt != 0)
                        new |= RW_HAS_WAITERS;
                rw_swap(rw, owner, new);
                turnstile_wakeup(ts, TS_WRITER_Q, 1, l);
        } else {
                RW_DASSERT(rw, rcnt != 0);

                /* Direct hand-off to all blocked readers. */
                new = rcnt << RW_READ_COUNT_SHIFT;
                rw_swap(rw, owner, new);
                turnstile_wakeup(ts, TS_READER_Q, rcnt, NULL);
        }


Andy, why was the current sometimes-prefer-readers behavior implemented?


Oh, BTW, I think I see a bug in the current rw_vector_exit():

/* Give the lock to the longest waiting writer. */
                        l = TS_FIRST(ts, TS_WRITER_Q);
new = (uintptr_t)l | RW_WRITE_LOCKED | RW_HAS_WAITERS;
                        if (wcnt != 0)
                                new |= RW_WRITE_WANTED;

                        ^^^^^
                        should be "if (wcnt > 1)"

I think the current test will always set RW_WRITE_WANTED, which is not what you want if you're waking the last writer. Andy, do you agree?






9) At some point the system runs out of free memory, so the sync
really needs to get done so memory can be reclaimed... but with some
100,000 items pending on the syncer_workitem_pending[] queues, the
system faces a real up-hill battle.

10) By forcing the lock to be RW_WRITER for all IO, it seems that
sched_sync() can keep up with everything, and the queues never get
"silly" large.

11) It really feels like there is a lack of backpressure somewhere,
and that WAPBL is just allowed to create files with wild abandon.

Thoughts/comments/solutions are welcome.

Later...

Greg Oster


-- thorpej



Home | Main Index | Thread Index | Old Index