NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: state or future of LFS?
On Apr 13, 2009, at 12:21 PM, Miles Nordin wrote:
"cs" == Chuck Swiger <cswiger%mac.com@localhost> writes:
cs> Yeah, well, relevant facts don't care about opinions: whether
cs> you acknowledge them or wish to ignore them is up to you.
and kool-aid apple zealot ranting doesn't care about
decisively-demonstrated experience
I'm actually a tequila-drinking BSD fan who likes FreeBSD for general
server purposes, NetBSD for embedded/flash-driven appliance roles, and
OSX for a desktop. As for "ranting", I'll leave that for those better
suited to it-- just as I'll abandon this thread as a waste of my time
if you continue the ad hominem attacks.
cs> That's absolutely right-- only, I mentioned the special API
cs> (that's the F_FULLFSYNC bit you quoted above).
yes. you mentioned it. It's existence is the problem. wtf does
fsync() itself do, then? serve as a time-wasting decoy! Enshrouded
in befuddling documentation to hide their benchmark-inflating lie.
Next time, when you don't know what something does, UTSL & RTFM:
"DESCRIPTION
Fsync() causes all modified data and attributes of fildes to be
moved to
a permanent storage device. This normally results in all in-
core modi-
fied copies of buffers for the associated file to be written to
a disk.
Note that while fsync() will flush all data from the host to the
drive
(i.e. the "permanent storage device"), the drive itself may not
physi-
cally write the data to the platters for quite some time and it
may be
written in an out-of-order sequence.
Specifically, if the drive loses power or the OS crashes, the
application
may find that only some or none of their data was written. The
disk
drive may also re-order the data so that later writes may be
present,
while earlier writes are not.
This is not a theoretical edge case. This scenario is easily
reproduced
with real world workloads and drive power failures.
For applications that require tighter guarantees about the
integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The
F_FULLFSYNC
fcntl asks the drive to flush all buffered data to permanent
storage.
Applications, such as databases, that require a strict ordering
of writes
should use F_FULLFSYNC to ensure that their data is written in
the order
they expect. Please see fcntl(2) for more detail."
Does that answer your question clearly enough? If you still find this
documentation "befuddling", I suppose I might see whether I can file a
radar with the appropriate parties to clarify, but IMO the above
documentation is more complete than, say:
http://netbsd.gw.com/cgi-bin/man-cgi?fsync+2+NetBSD-current
cs> As far as I can tell (ie, from looking at the code), by
cs> default OSX does synchronous updates to data and async updates
cs> to filesystem metadata because it trusts the journaling
cs> mechanism to keep the metadata consistent
nobody does synchronous updates to data. You are saying it has no
write cache at all. Of course there is a write cache.
My point is that MySQL calls fsync() on Mac OS X, and Mac OS X does
not synchronously update the data before returning. period. I do not
care about metadata, nor does MySQL. Are you saying I'm wrong and
MySQL is wrong?
Evidently: yes.
I've seen plenty of these debates about what fsync() does on various
platforms before, as well as the corresponding debate on the
interaction with the write-cache of the drives, and the main
difference between OSX and other platforms isn't whether fsync()
returns before the data is written out to the drives, but whether the
platform lets fsync() return without confirming whether the disk
itself has flushed the data to the platters or not.
If write-caching is enabled and the disks are not mounted using -o
sync or equivalent, *all* of the platforms I am familiar with let
fsync() cheat in the sense that they do not wait until the disk cache
itself has been flushed before returning.
What's interesting about OSX isn't that it does the same thing here
that other platforms do, but that it has the optional capability to
ensure transaction ordering and writes completing through the write-
cache on a per-descriptor or system-wide basis using fsync(), without
needing to disable write-caching entirely for other processes.
cs> But let's focus on just what other Unices do with fsync():
yes, this is pretty interesting.
But the point is, the MySQL guys observed corruption problems related
to broken fsync() on Mac OS X, only, not on other unixes. I do not
doubt that other unixes have broken fsync() commands, especially
w.r.t. sending flush commands to drives with write caches. Then there
are supposedly lying drives, broken iSCSI stacks, VM implementations
that discard ATA sync commands, all sorts of gremlins. and
furthermore the fsync API is limited. And all this is worth talking
about.
Fine. However, I've seen no evidence from you or from the MySQL link
provided earlier which suggests that OSX is doing anything different
with fsync() than the other platforms: without specifics, I would be
willing to conclude that the data corruption being reported was either
due to a bug with MySQL expecting Linux kernel semantics or with the
drive re-ordering cached writes and had something go wrong (ie, OS bug
or power yanked).
I suppose it's also possible that OSX itself has changed significantly
since the MySQL issue you reported, but my recollection is that these
fsync() changes and a related integration of lockf() vs. flock()
handling were all made way back around 2000-2002 timeframe (ie, MacOS
X 10.0 / 10.1). If the issue is still present in 10.3 or later, it
should be obvious and easy to reproduce with current OS versions, no...?
However none of that overshadows the FACT that all these broken
limited unixes still permit a mostly-functional MySQL implementation.
To achieve feature parity on Mac OS X, MySQL had to use this
proprietary made-up-bullshit Apple-only API. so do not tell me
fsync() on Mac OS X is not broken w.r.t. other Unixes. it is. You
have to do ``simon-sez, fsync'' or else it does nothing.
You're either mistaken or are deliberately wrong. I've provided
pointers to the exact place in the XNU source code where fsync(2) is
implemented, and noted that it waits until the data blocks are written
out by buf_flushdirtyblks() because MNT_WAIT is being set. Feel free
to read the sources yourself.
The effect of the "proprietary made-up-bullshit Apple-only API" is
that it provides tighter semantics for fsync() that what POSIX
requires (not that that is hard), with the intention of enforcing
write ordering and guaranteeing that the data really is written all
the way to disk platters and not just to the write cache of the
drive-- this is nice for databases which ought to require ACID
semantics.
And this is a clear win when you compare some benchmark package that
doesn't know
about simon-sez on Apple yet---you'll never catch the lie because
people don't do cord-yank tests of their filebench runs. It's a nasty
trick Apple pulled!
You're welcome to believe that if you like, but I can't see which
facts are being used to support such a conclusion. Frankly, going by
the MySQL benchmarks I've seen, the performance improvements by having
gettimeofday() be handled by a fast commpage mechanism rather than via
a system call makes so much more of a difference to MySQL that the
performance difference with fsync() is down in the noise.
(To my mind, that says more about MySQL's code and the focus of Linux
upon system-call micro-benchmarks than it proves about anything else,
but YMMV....)
And all the stuff about metadata is a little bit silly because the
only metadata that's relevant to fsync is the mtime. This
create/delete/rename stuff is mostly unrelated to fsync, other than a
few corner cases like creates and size-increases need to be pushed
(and, according to SQLite3, they generally are). [ ... ]
You're ignoring edge cases which are vital to consider: for example,
when you write enough data that the file needs to change
representation (ie, from fitting in the direct inode list to indirect
in FFS, or indirect to double-indirect, etc or similar equivalent with
HFS+ B-trees), the file metadata has to also change and an indirect
pointer block allocated for the write of the data contents to be
visible.
In the case of FFS, these changes happen whenever you grow a file past
80KB in size (assuming 10 direct links per inode with 8K block size),
which happens plenty often.
Regards,
--
-Chuck
Home |
Main Index |
Thread Index |
Old Index