>>>>> "cs" == Chuck Swiger <cswiger%mac.com@localhost> writes: cs> Yeah, well, relevant facts don't care about opinions: whether cs> you acknowledge them or wish to ignore them is up to you. and kool-aid apple zealot ranting doesn't care about decisively-demonstrated experience cs> That's absolutely right-- only, I mentioned the special API cs> (that's the F_FULLFSYNC bit you quoted above). yes. you mentioned it. It's existence is the problem. wtf does fsync() itself do, then? serve as a time-wasting decoy! Enshrouded in befuddling documentation to hide their benchmark-inflating lie. cs> As far as I can tell (ie, from looking at the code), by cs> default OSX does synchronous updates to data and async updates cs> to filesystem metadata because it trusts the journaling cs> mechanism to keep the metadata consistent nobody does synchronous updates to data. You are saying it has no write cache at all. Of course there is a write cache. My point is that MySQL calls fsync() on Mac OS X, and Mac OS X does not synchronously update the data before returning. period. I do not care about metadata, nor does MySQL. Are you saying I'm wrong and MySQL is wrong? cs> But let's focus on just what other Unices do with fsync(): yes, this is pretty interesting. But the point is, the MySQL guys observed corruption problems related to broken fsync() on Mac OS X, only, not on other unixes. I do not doubt that other unixes have broken fsync() commands, especially w.r.t. sending flush commands to drives with write caches. Then there are supposedly lying drives, broken iSCSI stacks, VM implementations that discard ATA sync commands, all sorts of gremlins. and furthermore the fsync API is limited. And all this is worth talking about. However none of that overshadows the FACT that all these broken limited unixes still permit a mostly-functional MySQL implementation. To achieve feature parity on Mac OS X, MySQL had to use this proprietary made-up-bullshit Apple-only API. so do not tell me fsync() on Mac OS X is not broken w.r.t. other Unixes. it is. You have to do ``simon-sez, fsync'' or else it does nothing. And this is a clear win when you compare some benchmark package that doesn't know about simon-sez on Apple yet---you'll never catch the lie because people don't do cord-yank tests of their filebench runs. It's a nasty trick Apple pulled! And all the stuff about metadata is a little bit silly because the only metadata that's relevant to fsync is the mtime. This create/delete/rename stuff is mostly unrelated to fsync, other than a few corner cases like creates and size-increases need to be pushed (and, according to SQLite3, they generally are). But even things like truncates are not pushed by fsync, and this isn't a problem. The API does not even indicate an imagineable way of pushing a delete or a rename so I wouldn't expect a filesystem to do that, ever, nor would a database. I don't know of any API for pushing create/delete/rename to the disk, and I brought up the 'make' example to show how these things might matter and how ZFS and LFS are better at dealing with them, but it's a different application than a database---here you care about things being done in order, not about things being done before you return. I suspect there is no API for these things at all because such things are typically not needed to maintain consistency of a file that represents database or a VM emulated-disk-backing-store. In any case, the metadata create/rename/delete stuff you talked about, and whether it's done ``synchronously'' or not in a variety of decade-old filesystems, is about whether it's done before the create/delete/rename call returns to the user and thus has nothing to do with fsync(). For example the fsync() vs. ext4 discussion that blew up among Linux folks, they settled on this as the proper way for an unsophisticated application to update a human-readable config file: 1. open temporary configfile.XXXXXX 2. copy the old file into the new file, making changes 3. fsync() it 4. close it 5. rename configfile.XXXXXX to configfile In this case, they care that configfile contains either the old data or the new data, but either one is fine. They want an atomic update---It must not contain garbage, no data, half-written data. To achieve that, it makes not a bit of difference whether the step 5 rename is ``synchronous'' or not---it's a distraction from the discussion about fsync. One uses fsync often to synchronize with something outside the kernel image, like an MTA handing off responsibility for an email message---you need to wait for fsync() to return before saying you've accepted the message. Or a DBMS that needs to sync the data to disk before reporting success to its external client (as promised by the ACID model, though AIUI not all, not even Oracle, do this full pedantry by default). What you care about a lot more for this type of metadata-heavy scenario ('make' or config file updating) is the order in which things happen. For example, on LFS or ZFS, the write() and the rename would be ordered with respect to each other, so you could skip the fsync call entirely and still count on the config file containing either old data or new. Skipping the fsync and still getting the guarantee means you can modify the config file faster. But the old ``FFS+sync'' presoftdep case you cited does not order write/rename w.r.t. each other, either, so it's making stronger promises in a direction of little practical value. The Mac OS X fsync() problems OTOH are weaker promises, and much more confusing promises which I'm still not sure you've gotten to the bottom of because you are talking about fsync() and metadata which are unrelated to each other aside from mtime. and in a direction that's relevantly detrimental, based on the experience of MySQL with many platforms.
Attachment:
pgpFzjEPqlL1C.pgp
Description: PGP signature