Subject: Re: fsync performance hit on 1.6.1
To: Chuck Silvers <chuq@chuq.com>
From: Daniel Brewer <danielb@cat.co.za>
List: tech-kern
Date: 07/07/2003 09:45:23
For our application, we have 1 process capturing images from a frame-grabber
and dumping them to a file in a memory file system. It sends a UDP message
to another process that reads the images from this file and writes them to
disk in a database. The first process has to guarantee that after it's done
writing that image to the memory file system, that it is actually there for
the second process to read. The problem is slightly more complicated, as we
have systems out in the field running 1.5.1 which we still need to support.

I've tried opening the file with O_SYNC and not doing the fsync(). On 1.5.1,
this has increased the write() time slightly, which was expected, but the
overall processor utilization hasn't changed much. On 1.6.1, this has the
same impact as just removing the fsync() - which I expected because of the
MNT_SYNCHRONOUS flag in the mfs mount in 1.6.1. So far, I haven't had any
problems with this solution. Will this solution guarantee that whatever the
first process writes is there for the second process to read?

----- Original Message -----
From: "Chuck Silvers" <chuq@chuq.com>
To: "Daniel Brewer" <danielb@cat.co.za>
Cc: <tech-kern@netbsd.org>
Sent: Sunday, July 06, 2003 6:00 PM
Subject: Re: fsync performance hit on 1.6.1


> hi,
>
> before 1.6, data from write() was cached in two separate lists,
> one for clean data and one for dirty data, so if there were many
> clean buffers and a few dirty ones, it was very fast for fsync()
> to find the few dirty buffers to write them out.  in 1.6 and later,
> the data is cached in a single list (of VM pages this time instead
> of buffers) and so fsync() has to do a lot more work to find the
> small number of dirty pages amidst all the clean pages.
>
> it would probably be worthwhile to split up the pages for a file
> into two different lists like the buffers were before.
> that will be a fair bit of work, though, so it's not likely that
> anyone will do it soon.
>
> to work around this problem for now, if you don't continue to access
> the data you're writing to the file you could try using msync() with
> MS_INVALIDATE instead of fsync() to write the data to disk.  that will
> eliminate all (or at least most) of the clean pages in memory so that
> the kernel won't need to keep scanning them all the time.  if you do
> continue to access the data after you've written it, then I don't
> have any good ideas.
>
> -Chuck
>
>
> On Wed, Jul 02, 2003 at 02:18:15PM +0200, Daniel Brewer wrote:
> > Hi
> >
> > When comparing our software's CPU utilization running on 151 & on 161, I
noticed an inexplicably high usage on 161. After digging deeper with gprof,
I found that an fsync on 161 takes significantly longer than on 151. Our
software writes captured images into a ring buffer in a memory file-system,
so other servers can retrieve them. Could someone explain the why fsync on
1.6.1 is so significantly slower than on 1.5.1? Is there a work-around? Or
perhaps a completely different way of doing this?
> >
> > The process involved after capture, is as follows:
> >
> > flock(fdPostBuf, LOCK_EX)
> > lseek(fdPostBuf, img_offset, SEEK_SET)
> > write(fdPostBuf, img_buf, img_size)
> > fsync(fdPostBuf)
> > flock(fdPostBuf, LOCK_UN)
> >
> > Both the 151 and 161 system are running the same processor (1.2GHz
Celeron) both have the same motherboard, have 128MB ram and are both running
Western Digital 20Gig drives. Running the 161 box on a 2GHz celeron with
256MB ram gives comparable results to the 1.2Ghz Celeron with 128MB ram.
> >
> > Here's an extract from the output of gprof on 1.5.1 :
> >   %   cumulative   self              self     total
> >  time   seconds   seconds    calls  ms/call  ms/call  name
> >   1.95     43.18     0.93    16364     0.06     0.06  write
> >   0.36     45.99     0.17     8025     0.02     0.02  fsync
> >   0.29     46.56     0.14    16052     0.01     0.01  flock
> >   0.04     47.49     0.02     8025     0.00     0.27
WritePostEventDisk(int, int, int, int)
> >
> > Here's the same extract from the output of gprof on 1.6.1 :
> >   %   cumulative   self              self     total
> >  time   seconds   seconds    calls  ms/call  ms/call  name
> >  60.10     23.15    23.15     8067     2.87     2.87  fsync
> >   5.87     34.52     2.26    16337     0.14     0.14  write
> >   0.42     36.76     0.16    16136     0.01     0.01  flock
> >   0.16     37.76     0.06     8067     0.01     3.30
WritePostEventDisk(int, int, int, int)
> >
> > Thanks
> > -Daniel
>