NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: state or future of LFS?



>>>>> "cs" == Chuck Swiger <cswiger%mac.com@localhost> writes:

    cs> Yeah, well, relevant facts don't care about opinions: whether
    cs> you acknowledge them or wish to ignore them is up to you.

and kool-aid apple zealot ranting doesn't care about
decisively-demonstrated experience

    cs> That's absolutely right-- only, I mentioned the special API
    cs> (that's the F_FULLFSYNC bit you quoted above).

yes.  you mentioned it.  It's existence is the problem.  wtf does
fsync() itself do, then?  serve as a time-wasting decoy!  Enshrouded
in befuddling documentation to hide their benchmark-inflating lie.

    cs> As far as I can tell (ie, from looking at the code), by
    cs> default OSX does synchronous updates to data and async updates
    cs> to filesystem metadata because it trusts the journaling
    cs> mechanism to keep the metadata consistent

nobody does synchronous updates to data.  You are saying it has no
write cache at all.  Of course there is a write cache.

My point is that MySQL calls fsync() on Mac OS X, and Mac OS X does
not synchronously update the data before returning.  period.  I do not
care about metadata, nor does MySQL.  Are you saying I'm wrong and
MySQL is wrong?

    cs> But let's focus on just what other Unices do with fsync():

yes, this is pretty interesting.

But the point is, the MySQL guys observed corruption problems related
to broken fsync() on Mac OS X, only, not on other unixes.  I do not
doubt that other unixes have broken fsync() commands, especially
w.r.t. sending flush commands to drives with write caches.  Then there
are supposedly lying drives, broken iSCSI stacks, VM implementations
that discard ATA sync commands, all sorts of gremlins.  and
furthermore the fsync API is limited.  And all this is worth talking
about.

However none of that overshadows the FACT that all these broken
limited unixes still permit a mostly-functional MySQL implementation.
To achieve feature parity on Mac OS X, MySQL had to use this
proprietary made-up-bullshit Apple-only API.  so do not tell me
fsync() on Mac OS X is not broken w.r.t. other Unixes.  it is.  You
have to do ``simon-sez, fsync'' or else it does nothing.  And this is
a clear win when you compare some benchmark package that doesn't know
about simon-sez on Apple yet---you'll never catch the lie because
people don't do cord-yank tests of their filebench runs.  It's a nasty
trick Apple pulled!

And all the stuff about metadata is a little bit silly because the
only metadata that's relevant to fsync is the mtime.  This
create/delete/rename stuff is mostly unrelated to fsync, other than a
few corner cases like creates and size-increases need to be pushed
(and, according to SQLite3, they generally are).  But even things like
truncates are not pushed by fsync, and this isn't a problem.  The API
does not even indicate an imagineable way of pushing a delete or a
rename so I wouldn't expect a filesystem to do that, ever, nor would a
database.

I don't know of any API for pushing create/delete/rename to the disk,
and I brought up the 'make' example to show how these things might
matter and how ZFS and LFS are better at dealing with them, but it's a
different application than a database---here you care about things
being done in order, not about things being done before you return.  I
suspect there is no API for these things at all because such things
are typically not needed to maintain consistency of a file that
represents database or a VM emulated-disk-backing-store.  In any case,
the metadata create/rename/delete stuff you talked about, and whether
it's done ``synchronously'' or not in a variety of decade-old
filesystems, is about whether it's done before the
create/delete/rename call returns to the user and thus has nothing to
do with fsync().

For example the fsync() vs. ext4 discussion that blew up among Linux
folks, they settled on this as the proper way for an unsophisticated
application to update a human-readable config file:

 1. open temporary configfile.XXXXXX
 2. copy the old file into the new file, making changes
 3. fsync() it
 4. close it
 5. rename configfile.XXXXXX to configfile

In this case, they care that configfile contains either the old data
or the new data, but either one is fine.  They want an atomic
update---It must not contain garbage, no data, half-written data.  To
achieve that, it makes not a bit of difference whether the step 5
rename is ``synchronous'' or not---it's a distraction from the
discussion about fsync.  One uses fsync often to synchronize with
something outside the kernel image, like an MTA handing off
responsibility for an email message---you need to wait for fsync() to
return before saying you've accepted the message.  Or a DBMS that
needs to sync the data to disk before reporting success to its
external client (as promised by the ACID model, though AIUI not all,
not even Oracle, do this full pedantry by default).

What you care about a lot more for this type of metadata-heavy
scenario ('make' or config file updating) is the order in which things
happen. For example, on LFS or ZFS, the write() and the rename would
be ordered with respect to each other, so you could skip the fsync
call entirely and still count on the config file containing either old
data or new.  Skipping the fsync and still getting the guarantee means
you can modify the config file faster.

But the old ``FFS+sync'' presoftdep case you cited does not order
write/rename w.r.t. each other, either, so it's making stronger
promises in a direction of little practical value.

The Mac OS X fsync() problems OTOH are weaker promises, and much more
confusing promises which I'm still not sure you've gotten to the
bottom of because you are talking about fsync() and metadata which are
unrelated to each other aside from mtime.  and in a direction that's
relevantly detrimental, based on the experience of MySQL with many
platforms.

Attachment: pgpFzjEPqlL1C.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index