Source-Changes archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: CVS commit: [simonb-wapbl] src/sys/kern



Jason Thorpe writes:
> 
> On Jul 27, 2008, at 9:04 PM, Greg Oster wrote:
> 
> >
> > Module Name:        src
> > Committed By:       oster
> > Date:               Mon Jul 28 04:04:32 UTC 2008
> >
> > Modified Files:
> >     src/sys/kern [simonb-wapbl]: vfs_wapbl.c
> >
> > Log Message:
> > Turn on WAPBL_DEBUG_SERIALIZE in order to use RW_WRITER locks instead
> > of RW_READER locks in wapbl_begin().  Include the following comment as
> > well:
> >
> >     XXX: The original code calls for the use of a RW_READER lock
> >     here, but it turns out there are performance issues with high
> >     metadata-rate workloads (e.g. multiple simultaneous tar
> >     extractions).  For now, we force the lock to be RW_WRITER,
> >     since that currently has the best performance characteristics
> >     (even for a single tar-file extraction).
> 
> Uh, scary.  Has anyone done any analysis of why this is so?

I'm still not sure exactly what's going on, but here are some notes
on what I believe to be happening:

 1) General writes are done as RW_READERs.

 2) Log flushing (and fyncs) are done as RW_WRITERs.

 3) When RW_WRITER finishes, all RW_READERS are signaled to "go", so
there is a 'thundering herd' after a log flush.

 4) With a number of meta-data-generating processes (e.g. 10x tar
-xf's) a huge amount of meta-data can be generated in a short amount
of time. 

 5) By adding instrumentation to 
  src/sys/miscfs/syncfs/sync_subr.c:vn_syncer_add_to_worklist() 
what I've seen is that some 85000+ items get queued in under 10
seconds, such that by the time the loop in sched_sync() gets around
to handling the first major onslaught, there are now some 15000-20000 
items waiting on the next queue.  And it just gets worse after that.

 6) Each of these queue items needs to be handled with VOP_FSYNC().

 7) I believe that's going to end up calling ffs_fsync(), which is
going to call wapbl_flush(..,0).  

 8) Now wapbl_flush() is going to need to get the RW_WRITER lock.
So perhaps it has to wait for the RW_READERs to finish before going
through and that's what we see as a delay?  Or perhaps it has
something to do with when that *one flush* gets done, rw_exit() 
turns all the READERs loose again?

 9) At some point the system runs out of free memory, so the sync
really needs to get done so memory can be reclaimed... but with some
100,000 items pending on the syncer_workitem_pending[] queues, the
system faces a real up-hill battle.

10) By forcing the lock to be RW_WRITER for all IO, it seems that
sched_sync() can keep up with everything, and the queues never get
"silly" large.

11) It really feels like there is a lack of backpressure somewhere,
and that WAPBL is just allowed to create files with wild abandon.

Thoughts/comments/solutions are welcome.  

Later...

Greg Oster




Home | Main Index | Thread Index | Old Index