NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: state or future of LFS?



On Apr 13, 2009, at 12:21 PM, Miles Nordin wrote:
"cs" == Chuck Swiger <cswiger%mac.com@localhost> writes:
   cs> Yeah, well, relevant facts don't care about opinions: whether
   cs> you acknowledge them or wish to ignore them is up to you.

and kool-aid apple zealot ranting doesn't care about
decisively-demonstrated experience

I'm actually a tequila-drinking BSD fan who likes FreeBSD for general server purposes, NetBSD for embedded/flash-driven appliance roles, and OSX for a desktop. As for "ranting", I'll leave that for those better suited to it-- just as I'll abandon this thread as a waste of my time if you continue the ad hominem attacks.

   cs> That's absolutely right-- only, I mentioned the special API
   cs> (that's the F_FULLFSYNC bit you quoted above).

yes.  you mentioned it.  It's existence is the problem.  wtf does
fsync() itself do, then?  serve as a time-wasting decoy!  Enshrouded
in befuddling documentation to hide their benchmark-inflating lie.

Next time, when you don't know what something does, UTSL & RTFM:

"DESCRIPTION
Fsync() causes all modified data and attributes of fildes to be moved to a permanent storage device. This normally results in all in- core modi- fied copies of buffers for the associated file to be written to a disk.

Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physi- cally write the data to the platters for quite some time and it may be
     written in an out-of-order sequence.

Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present,
     while earlier writes are not.

This is not a theoretical edge case. This scenario is easily reproduced
     with real world workloads and drive power failures.

For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order
     they expect.  Please see fcntl(2) for more detail."

Does that answer your question clearly enough? If you still find this documentation "befuddling", I suppose I might see whether I can file a radar with the appropriate parties to clarify, but IMO the above documentation is more complete than, say:

  http://netbsd.gw.com/cgi-bin/man-cgi?fsync+2+NetBSD-current

   cs> As far as I can tell (ie, from looking at the code), by
   cs> default OSX does synchronous updates to data and async updates
   cs> to filesystem metadata because it trusts the journaling
   cs> mechanism to keep the metadata consistent

nobody does synchronous updates to data.  You are saying it has no
write cache at all.  Of course there is a write cache.

My point is that MySQL calls fsync() on Mac OS X, and Mac OS X does
not synchronously update the data before returning.  period.  I do not
care about metadata, nor does MySQL.  Are you saying I'm wrong and
MySQL is wrong?

Evidently: yes.

I've seen plenty of these debates about what fsync() does on various platforms before, as well as the corresponding debate on the interaction with the write-cache of the drives, and the main difference between OSX and other platforms isn't whether fsync() returns before the data is written out to the drives, but whether the platform lets fsync() return without confirming whether the disk itself has flushed the data to the platters or not.

If write-caching is enabled and the disks are not mounted using -o sync or equivalent, *all* of the platforms I am familiar with let fsync() cheat in the sense that they do not wait until the disk cache itself has been flushed before returning.

What's interesting about OSX isn't that it does the same thing here that other platforms do, but that it has the optional capability to ensure transaction ordering and writes completing through the write- cache on a per-descriptor or system-wide basis using fsync(), without needing to disable write-caching entirely for other processes.

   cs> But let's focus on just what other Unices do with fsync():

yes, this is pretty interesting.

But the point is, the MySQL guys observed corruption problems related
to broken fsync() on Mac OS X, only, not on other unixes.  I do not
doubt that other unixes have broken fsync() commands, especially
w.r.t. sending flush commands to drives with write caches.  Then there
are supposedly lying drives, broken iSCSI stacks, VM implementations
that discard ATA sync commands, all sorts of gremlins.  and
furthermore the fsync API is limited.  And all this is worth talking
about.

Fine. However, I've seen no evidence from you or from the MySQL link provided earlier which suggests that OSX is doing anything different with fsync() than the other platforms: without specifics, I would be willing to conclude that the data corruption being reported was either due to a bug with MySQL expecting Linux kernel semantics or with the drive re-ordering cached writes and had something go wrong (ie, OS bug or power yanked).

I suppose it's also possible that OSX itself has changed significantly since the MySQL issue you reported, but my recollection is that these fsync() changes and a related integration of lockf() vs. flock() handling were all made way back around 2000-2002 timeframe (ie, MacOS X 10.0 / 10.1). If the issue is still present in 10.3 or later, it should be obvious and easy to reproduce with current OS versions, no...?

However none of that overshadows the FACT that all these broken
limited unixes still permit a mostly-functional MySQL implementation.
To achieve feature parity on Mac OS X, MySQL had to use this
proprietary made-up-bullshit Apple-only API.  so do not tell me
fsync() on Mac OS X is not broken w.r.t. other Unixes.  it is.  You
have to do ``simon-sez, fsync'' or else it does nothing.

You're either mistaken or are deliberately wrong. I've provided pointers to the exact place in the XNU source code where fsync(2) is implemented, and noted that it waits until the data blocks are written out by buf_flushdirtyblks() because MNT_WAIT is being set. Feel free to read the sources yourself.

The effect of the "proprietary made-up-bullshit Apple-only API" is that it provides tighter semantics for fsync() that what POSIX requires (not that that is hard), with the intention of enforcing write ordering and guaranteeing that the data really is written all the way to disk platters and not just to the write cache of the drive-- this is nice for databases which ought to require ACID semantics.

And this is a clear win when you compare some benchmark package that doesn't know
about simon-sez on Apple yet---you'll never catch the lie because
people don't do cord-yank tests of their filebench runs.  It's a nasty
trick Apple pulled!

You're welcome to believe that if you like, but I can't see which facts are being used to support such a conclusion. Frankly, going by the MySQL benchmarks I've seen, the performance improvements by having gettimeofday() be handled by a fast commpage mechanism rather than via a system call makes so much more of a difference to MySQL that the performance difference with fsync() is down in the noise.

(To my mind, that says more about MySQL's code and the focus of Linux upon system-call micro-benchmarks than it proves about anything else, but YMMV....)

And all the stuff about metadata is a little bit silly because the
only metadata that's relevant to fsync is the mtime.  This
create/delete/rename stuff is mostly unrelated to fsync, other than a
few corner cases like creates and size-increases need to be pushed
(and, according to SQLite3, they generally are).  [ ... ]

You're ignoring edge cases which are vital to consider: for example, when you write enough data that the file needs to change representation (ie, from fitting in the direct inode list to indirect in FFS, or indirect to double-indirect, etc or similar equivalent with HFS+ B-trees), the file metadata has to also change and an indirect pointer block allocated for the write of the data contents to be visible.

In the case of FFS, these changes happen whenever you grow a file past 80KB in size (assuming 10 direct links per inode with 8K block size), which happens plenty often.

Regards,
--
-Chuck



Home | Main Index | Thread Index | Old Index