tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Lost file-system story



On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods <woods%planix.ca@localhost> wrote:
> At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen 
> <donaldcallen%gmail.com@localhost> wrote:
> Subject: Re: Lost file-system story
>>
>> "does not guarantee to keep a consistent file system structure on the
>> disk" is what I expected from NetBSD. From what I've been told in this
>> discussion, NetBSD pretty much guarantees that if you use async and
>> the system crashes, you *will* lose the filesystem if there's been any
>> writing to it for an arbitrarily long period of time, since apparently
>> meta-data for async filesystems doesn't get written as a matter of
>> course.
>
> I'm not sure what the difference is.

You would be sure if you'd read my posts carefully. The difference is
whether the probability of an async-mounted filesystem is near zero or
near one.

 You seem to be quibbling over
> minor differences and perhaps one-off experiences.

Having a crash almost certainly destroy your filesystem vs. having the
filesystem almost certainly survive a crash is not a minor difference.

 Both OpenBSD and
> NetBSD also say that you should not use the "async" flag unless you are
> prepared to recreate the file system from scratch if your system
> crashes.  That means use newfs(8) [and, by implication, something like
> restore(8)], not fsck(8), to recover after a crash.  You got lucky with
> your test on OpenBSD.

>
>
>> And then there's the matter of NetBSD fsck apparently not
>> really being designed to cope with the mess left on the disk after
>> such a crash. Please correct me if I've misinterpreted what's been
>> said here (there have been a few different stories told, so I'm trying
>> to compute the mean).
>
> That's been true of Unix (and many unix-like) filesystems and their
> fsck(8) commands since the beginning of Unix.
>
> fsck(8) is designed to rely on the possible states of on-disk filesystem
> metadata because that's now Unix-based filesystems have been guaranteed
> to work (barring use of MNT_ASYNC, obviously).
>
> And that's why by default, and by very strong recommendation, filesystem
> metadata for Unix-based filesystems (sans WABPL) should always be
> written synchronously to the disk if you ever hope to even try to use
> fsck(8).

That's simply not true. Have you ever used Linux in all the years that
 ext2 was the predominant filesystem? ext2 filesystems were routinely
mounted async for many years; everything -- data, meta-data -- was
written asynchronously with no regard to ordering. And yet, when those
systems crashed, fsck generally, not always, but usually, restored the
filesystem to working order. Of course, some data could be lost and
was, but you rarely suffered the loss of an entire filesystem. That's
a fact.

>
>
>> I am not telling the OpenBSD story to rub NetBSD peoples' noses in it.
>> I'm simply pointing out that that system appears to be an example of
>> ffs doing what I thought it did and what I know ext2 and journal-less
>> ext4 do -- do a very good job of putting the world into operating
>> order (without offering an impossible guarantee to do so) after a
>> crash when async is used, after having been told that ffs and its fsck
>> were not designed to do this.
>
> You seem to be very confused about what MNT_ASYNC is and is not.  :-)

No, you don't understand what I've said.

>
> Unix filesystems, including Berkeley Fast File System variant, have
> never made any guarantees about the recoverability of an async-mounted
> filesystem after a crash.

I never thought or asserted otherwise.

>
> You seem to have inferred some impossible capability based on your
> experience with other non-Unix filesystems that have a completely
> different internal structure and implementation from the Unix-based
> filesystems in NetBSD.

Nonsense -- I have inferred no such thing. Instead of referring you to
previous posts for a re-read, I'll give you a little summary. I am
speaking about probabilities. I completely understand that no
filesystem mounted async (or any other way, for that matter), whether
Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash. The
probability of surviving a crash for any of them is < 1. But my
experience with Linux ext2 over many years has been that the
probability of survival is quite high, near 1. When I reported my
experience with NetBSD ffs in this thread, I expressed surprise that
the filesystem was a total loss, based on what preceded the crash. My
surprise was a result of years of Linux experience. I then got some
responses -- see the one from Thor Lancelot Simon, for example. In
that message, he asserts that, in NetBSD, *nothing* pushes meta-data
to the disk for a filesystem mounted async. Others said some
contradictory things about that and I'm not sure what the truth is,
but if Simon is right, then the probability of crash survival in
NetBSD is indeed near zero. Another point that was made was that
NetBSD ffs fsck was not designed to put a damaged filesystem back
together, at least the kind of damage one might encounter with async
mounting. The probability of an async filesystem surviving a crash is
directly related to

- how often meta-data is written to the disk from the buffer cache
- how smart fsck is

Linux systems do periodically write ext2 meta-data to the disk. And
ext2 fsck has always been very good, and has gotten better over the
years, due to the efforts of Ted T'so. I first installed Linux in
1993, almost 20 years ago, and have been using it continuously ever
since. I have *never* lost an ext2 filesystem and I've never mounted
one sync.

Apparently the same cannot be said for NetBSD's ffs implementation or
its fsck; async in NetBSD seemingly has much narrower applicability
than has been the case for years in Linux. And my point in telling you
about the OpenBSD experiment was simply to demonstrate that their
probability of survival  is > 0, that their ffs+fsck implementation
*can* survive a crash, even one during a lot of write activity. That's
all that can be concluded from my one data point. I am making no
assertions about how close to 1 their probability is because I don't
have the data to do so. However, you made a  statement -- "You got
lucky with your test on OpenBSD" -- that you didn't support and might
not be true. To find out whether it's true, we'd have to repeat my
experiment a large number of times. If the system recovered in a large
fraction of those experiments, then you're wrong. A small fraction,
and you're right. This is just another way of saying we don't know how
close to 1 OpenBSD's probability of recovery is; we just know it's not
0.

/Don Allen

>
> Perhaps the BSD manuals have assumed some knowledge of Unix history, but
> even the NetBSD-1.6 mount(8) manual, from 2002, is _extremely_ clear
> about the dangers of the "async" flag, with strong emphasis in the
> formatted text on the relevant warning:
>
>     async       All I/O to the file system should be done asyn-
>                 chronously.  In the event of a crash, _it_is_
>                 _impossible_for_the_system_to_verify_the_integrity_of_
>                 _data_on_a_file_system_mounted_with_this_option._  You
>                 should only use this option if you have an applica-
>                 tion-specific data recovery mechanism, or are willing
>                 to recreate the file system from scratch.
>
> According to CVS that wording has not changed since October 1, 2002, and
> the emphasised text has been there unchanged since September 16, 1998.
>
>> So I'd love it if my experience encourages someone to improve NetBSD
>> ffs and fsck to make use of async practical
>
> As others have already said, this has already been done.  It's called
> WABPL.  See wapbl(4) for more information.  Use "mount -o log" to enable
> it.
>

No. If I understand correctly what WABPL is, it's journaling added to
ffs, just as ext3 is ext3 + journaling. ext3 is slower than ext2 (why
did Google use ext2 for so long? And why, now that they've converted
to ext4, have they chosen to run it with the journal turned off?
Because journals exact a performance cost, that's why) and unless the
laws of physics have been repealed ffs+journaling is going to be
slower than async ffs without journaling. It's true that I haven't
investigated the performance differential for my application, but
given that I have a choice between NetBSD and Linux for this,
ffs+WABPL's competition is ext4 without journaling. I don't know this
for a fact (yet), but my guess is that the latter will be enough
faster to matter.

> (BTW, I personally don't think you would want to use softdep -- it can
> suffer almost as badly as async after a crash, though perhaps without
> totally invalidating fsck(8)'s ability to at least recover files and
> directories which were static since mount; and it does also offer vastly
> improved performance in many use cases, but as the manual says, it
> should still be used with care (i.e. recognition of the risks of
> less-tested, much more complex code, and vastly changed internal
> implmentation semantics implying radically different recovery modes.)
>
> --
>                                                Greg A. Woods
>                                                Planix, Inc.
>
> <woods%planix.com@localhost>       +1 250 762-7675       
>  http://www.planix.com/


Home | Main Index | Thread Index | Old Index