Re: state or future of LFS?

To: Miles Nordin <carton%Ivy.NET@localhost>
Subject: Re: state or future of LFS?
From: Chuck Swiger <cswiger%mac.com@localhost>
Date: Mon, 13 Apr 2009 15:51:54 -0700

On Apr 13, 2009, at 12:21 PM, Miles Nordin wrote:

"cs" == Chuck Swiger <cswiger%mac.com@localhost> writes:

   cs> Yeah, well, relevant facts don't care about opinions: whether
   cs> you acknowledge them or wish to ignore them is up to you.

and kool-aid apple zealot ranting doesn't care about
decisively-demonstrated experience

I'm actually a tequila-drinking BSD fan who likes FreeBSD for generalserver purposes, NetBSD for embedded/flash-driven appliance roles, andOSX for a desktop. As for "ranting", I'll leave that for those bettersuited to it-- just as I'll abandon this thread as a waste of my timeif you continue the ad hominem attacks.

   cs> That's absolutely right-- only, I mentioned the special API
   cs> (that's the F_FULLFSYNC bit you quoted above).

yes.  you mentioned it.  It's existence is the problem.  wtf does
fsync() itself do, then?  serve as a time-wasting decoy!  Enshrouded
in befuddling documentation to hide their benchmark-inflating lie.


Next time, when you don't know what something does, UTSL & RTFM:

"DESCRIPTION

Fsync() causes all modified data and attributes of fildes to bemoved toa permanent storage device. This normally results in all in-core modi-fied copies of buffers for the associated file to be written toa disk.

Note that while fsync() will flush all data from the host to thedrive(i.e. the "permanent storage device"), the drive itself may notphysi-cally write the data to the platters for quite some time and itmay be

     written in an out-of-order sequence.

Specifically, if the drive loses power or the OS crashes, theapplicationmay find that only some or none of their data was written. Thediskdrive may also re-order the data so that later writes may bepresent,

     while earlier writes are not.

This is not a theoretical edge case. This scenario is easilyreproduced

     with real world workloads and drive power failures.

For applications that require tighter guarantees about theintegrity oftheir data, Mac OS X provides the F_FULLFSYNC fcntl. TheF_FULLFSYNCfcntl asks the drive to flush all buffered data to permanentstorage.Applications, such as databases, that require a strict orderingof writesshould use F_FULLFSYNC to ensure that their data is written inthe order

     they expect.  Please see fcntl(2) for more detail."

Does that answer your question clearly enough? If you still find thisdocumentation "befuddling", I suppose I might see whether I can file aradar with the appropriate parties to clarify, but IMO the abovedocumentation is more complete than, say:


  http://netbsd.gw.com/cgi-bin/man-cgi?fsync+2+NetBSD-current

   cs> As far as I can tell (ie, from looking at the code), by
   cs> default OSX does synchronous updates to data and async updates
   cs> to filesystem metadata because it trusts the journaling
   cs> mechanism to keep the metadata consistent

nobody does synchronous updates to data.  You are saying it has no
write cache at all.  Of course there is a write cache.

My point is that MySQL calls fsync() on Mac OS X, and Mac OS X does
not synchronously update the data before returning.  period.  I do not
care about metadata, nor does MySQL.  Are you saying I'm wrong and
MySQL is wrong?


Evidently: yes.

I've seen plenty of these debates about what fsync() does on variousplatforms before, as well as the corresponding debate on theinteraction with the write-cache of the drives, and the maindifference between OSX and other platforms isn't whether fsync()returns before the data is written out to the drives, but whether theplatform lets fsync() return without confirming whether the diskitself has flushed the data to the platters or not.

If write-caching is enabled and the disks are not mounted using -osync or equivalent, *all* of the platforms I am familiar with letfsync() cheat in the sense that they do not wait until the disk cacheitself has been flushed before returning.

What's interesting about OSX isn't that it does the same thing herethat other platforms do, but that it has the optional capability toensure transaction ordering and writes completing through the write-cache on a per-descriptor or system-wide basis using fsync(), withoutneeding to disable write-caching entirely for other processes.

   cs> But let's focus on just what other Unices do with fsync():

yes, this is pretty interesting.

But the point is, the MySQL guys observed corruption problems related
to broken fsync() on Mac OS X, only, not on other unixes.  I do not
doubt that other unixes have broken fsync() commands, especially
w.r.t. sending flush commands to drives with write caches.  Then there
are supposedly lying drives, broken iSCSI stacks, VM implementations
that discard ATA sync commands, all sorts of gremlins.  and
furthermore the fsync API is limited.  And all this is worth talking
about.

Fine. However, I've seen no evidence from you or from the MySQL linkprovided earlier which suggests that OSX is doing anything differentwith fsync() than the other platforms: without specifics, I would bewilling to conclude that the data corruption being reported was eitherdue to a bug with MySQL expecting Linux kernel semantics or with thedrive re-ordering cached writes and had something go wrong (ie, OS bugor power yanked).

I suppose it's also possible that OSX itself has changed significantlysince the MySQL issue you reported, but my recollection is that thesefsync() changes and a related integration of lockf() vs. flock()handling were all made way back around 2000-2002 timeframe (ie, MacOSX 10.0 / 10.1). If the issue is still present in 10.3 or later, itshould be obvious and easy to reproduce with current OS versions, no...?

However none of that overshadows the FACT that all these broken
limited unixes still permit a mostly-functional MySQL implementation.
To achieve feature parity on Mac OS X, MySQL had to use this
proprietary made-up-bullshit Apple-only API.  so do not tell me
fsync() on Mac OS X is not broken w.r.t. other Unixes.  it is.  You
have to do ``simon-sez, fsync'' or else it does nothing.

You're either mistaken or are deliberately wrong. I've providedpointers to the exact place in the XNU source code where fsync(2) isimplemented, and noted that it waits until the data blocks are writtenout by buf_flushdirtyblks() because MNT_WAIT is being set. Feel freeto read the sources yourself.

The effect of the "proprietary made-up-bullshit Apple-only API" isthat it provides tighter semantics for fsync() that what POSIXrequires (not that that is hard), with the intention of enforcingwrite ordering and guaranteeing that the data really is written allthe way to disk platters and not just to the write cache of thedrive-- this is nice for databases which ought to require ACIDsemantics.

And this is a clear win when you compare some benchmark package thatdoesn't know
about simon-sez on Apple yet---you'll never catch the lie because
people don't do cord-yank tests of their filebench runs.  It's a nasty
trick Apple pulled!

You're welcome to believe that if you like, but I can't see whichfacts are being used to support such a conclusion. Frankly, going bythe MySQL benchmarks I've seen, the performance improvements by havinggettimeofday() be handled by a fast commpage mechanism rather than viaa system call makes so much more of a difference to MySQL that theperformance difference with fsync() is down in the noise.

(To my mind, that says more about MySQL's code and the focus of Linuxupon system-call micro-benchmarks than it proves about anything else,but YMMV....)

And all the stuff about metadata is a little bit silly because the
only metadata that's relevant to fsync is the mtime.  This
create/delete/rename stuff is mostly unrelated to fsync, other than a
few corner cases like creates and size-increases need to be pushed
(and, according to SQLite3, they generally are).  [ ... ]

You're ignoring edge cases which are vital to consider: for example,when you write enough data that the file needs to changerepresentation (ie, from fitting in the direct inode list to indirectin FFS, or indirect to double-indirect, etc or similar equivalent withHFS+ B-trees), the file metadata has to also change and an indirectpointer block allocated for the write of the data contents to bevisible.

In the case of FFS, these changes happen whenever you grow a file past80KB in size (assuming 10 direct links per inode with 8K block size),which happens plenty often.


Regards,
--
-Chuck

References:
- state or future of LFS?
  - From: Niels Dettenbach
- Re: state or future of LFS?
  - From: Wouter Klouwen
- Re: state or future of LFS?
  - From: Niels Dettenbach
- Re: state or future of LFS?
  - From: Wouter Klouwen
- Re: state or future of LFS?
  - From: Miles Nordin
- Re: state or future of LFS?
  - From: Chuck Swiger
- Re: state or future of LFS?
  - From: Miles Nordin
- Re: state or future of LFS?
  - From: Chuck Swiger
- Re: state or future of LFS?
  - From: Miles Nordin

Prev by Date: Re: Overriding BIOS poweroff when ACPI is disabled
Next by Date: Re: state or future of LFS?
Previous by Thread: Re: state or future of LFS?
Next by Thread: Re: state or future of LFS?
Indexes:

Home | Main Index | Thread Index | Old Index