tech-kern: Re: 3.0.1: softdep + ffsv2 + 'heavy' load = pauses

Subject: Re: 3.0.1: softdep + ffsv2 + 'heavy' load = pauses
To: Mark Cullen <mark.r.cullen@gmail.com>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 07/22/2006 16:57:55
On Sat, Jul 22, 2006 at 08:03:22PM +0100, Mark Cullen wrote:
>
> >>I have disabled softdep on all mounts now and I am giving 
> >>BUFQ_PRIOCSCAN a try. Hopefully it won't hang this time, but I 
> >>shalln't be able to test until tonight.

It is really misleading that you use the word "hang".  Your machine
is not hung; on the other hand, it has a huge backlog of I/O so that
when you try to do new I/O operations, they take a very, very long
time to complete.

> >I'll try turning off softdeps and see if the problem goes away.
> >
> 
> I'm fairly sure softdep is the cause of this problem. My system now 
> feels far, far more responsive without softdep on any of the mounts, 
> while doing cvs updates and such. There's no annoying 10 second pauses 
> when trying to run things or anything!

Here is what is going on, and you're correct that it's a problem with
soft dependencies:

1) With softdep, the filesystem will accept new I/O -- particularly
   metadata operations that often require disk seeks, or at least
   separate disk transactions -- until the cache is full.  This can mean
   the enqueueing of tens of thousands of disk transactions in a single
   second.

2) Your disk can only dequeue -- that is, complete -- at most a few
   thousand operations per second.

3) So, if you do something like untar a pkgsrc tree (or rm -rf one) with
   softdep turned on, you can be sure that your disk has a backlog of
   at least several seconds of I/O waiting to complete.

4) The softdep implementation of "trickle sync" has the extremely nasty
   property of trying to flush all metadata I/O at the same time, every
   15 or 30 seconds (depending whether it is directory or other metadata
   I/O).  This is a bug, but one that would require replacing the softdep
   smooth sync code to fix.

5) So, 15-30 seconds after doing some operation that generates tens of
   thousands of disk transactions per second for a second or two, you
   will find all that I/O trying to flush to the disk at once.  The
   result is painful, but obvious: if you have to _read_ data from the
   disk to do some operation -- e.g. to launch a new executable -- you
   will have to wait until the many seconds of queued I/O completes.

There are a few approaches you can take to make this problem less annoying.
You can get a disk I/O subsystem with a large write-back cache and the
ability to enqueue multiple read operations at once so that it can bypass
reads around writes.  These are often expensive (think hardware RAID
controller with cache battery).  Or you can tweak DIRDELAY and METADELAY
in the softdep code to make the potential backlog smaller and flush it
to disk much more often.  Or you can limit the metadata cache at a smaller
size -- this is probably the worst approach.

Or, if you're using -current, you could consider using LFS, which is
finally pretty stable and which doesn't suffer from the whole problem
described above.

Thor