Re: Lost file-system story

To: Donald Allen <donaldcallen%gmail.com@localhost>
Subject: Re: Lost file-system story
From: "Greg A. Woods" <woods%planix.ca@localhost>
Date: Sun, 11 Dec 2011 18:53:26 -0800

At Fri, 9 Dec 2011 22:12:25 -0500, Donald Allen 
<donaldcallen%gmail.com@localhost> wrote:
Subject: Re: Lost file-system story
> 
> On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods <woods%planix.ca@localhost> 
> wrote:
> > At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen 
> > <donaldcallen%gmail.com@localhost> wrote:
> > Subject: Re: Lost file-system story
> > >
> > > "does not guarantee to keep a consistent file system structure on the
> > > disk" is what I expected from NetBSD. From what I've been told in this
> > > discussion, NetBSD pretty much guarantees that if you use async and
> > > the system crashes, you *will* lose the filesystem if there's been any
> > > writing to it for an arbitrarily long period of time, since apparently
> > > meta-data for async filesystems doesn't get written as a matter of
> > > course.
> >
> > I'm not sure what the difference is.
> 
> You would be sure if you'd read my posts carefully. The difference is
> whether the probability of an async-mounted filesystem is near zero or
> near one.

I think perhaps the misunderstanding between you and everyone else is
because you haven't fully appreciated what everyone has been trying to
tell you about the true meaning of "async" in Unix-based filesystems,
and in particular about NetBSD's current implementation of Unix-based
filesystems, and what that all means to implementing algorithms that can
relibably repair the on-disk image of a filesystem after a crash.

I would have thought the warning given in the description of "async" in
mount(8) would be sufficient, but apparently you haven't read it that
way.

Perhaps the problem is the last occurance of the word "or" in the last
sentence of that warning should be changed to "and".  To me that would
at least make the warning a bit stronger.

> > And that's why by default, and by very strong recommendation, filesystem
> > metadata for Unix-based filesystems (sans WABPL) should always be
> > written synchronously to the disk if you ever hope to even try to use
> > fsck(8).
> 
> That's simply not true. Have you ever used Linux in all the years that
>  ext2 was the predominant filesystem? ext2 filesystems were routinely
> mounted async for many years; everything -- data, meta-data -- was
> written asynchronously with no regard to ordering. 

DO NOT confuse any Linux-based filesystem with any Unix-based
filesystem.  They may have nearly identical semantics from the user
programming perspective (i.e. POSIX), but they're all entirely different
under the hood.

Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have
never ever Ever EVER given any guarantee about the repariability of the
filesystem after a crash if it has been mounted with MNT_ASYNC.

Indeed it is more or less _impossible_ by design for the system to make
any such guarantee given what MNT_ASYNC actually means for Unix-based
filesystems, and especially what it means in the NetBSD implementation.

> > Unix filesystems, including Berkeley Fast File System variant, have
> > never made any guarantees about the recoverability of an async-mounted
> > filesystem after a crash.
> 
> I never thought or asserted otherwise.

Well, from my perspective, especially after carefully reading your
posts, you do indeed seem to think that async-mounted Unix-based
filesystems should be able to be repaired, at least some of the time,
despite the documentation, and all the collected wisdom of those who've
replied to your posts so far, saying otherwise.

> > You seem to have inferred some impossible capability based on your
> > experience with other non-Unix filesystems that have a completely
> > different internal structure and implementation from the Unix-based
> > filesystems in NetBSD.
> 
> Nonsense -- I have inferred no such thing. Instead of referring you to
> previous posts for a re-read, I'll give you a little summary. I am
> speaking about probabilities. I completely understand that no
> filesystem mounted async (or any other way, for that matter), whether
> Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash.

OK, let's try stating this once more in what I hope are the same terms
you're trying to use:  The probablility of any Unix-based filesystem
being repariable after a crash is zero (0) if it has been mounted with
MNT_ASYNC, and if there was _any_ activity that affected its structure
since mount time up to the time of the crash.  It still might survive
after some types of changes, but it _probably_ won't.  There are no
guarantees.  Use "newfs" and "restore" to recover.

Linux ext2 is not a Unix-based filesystem and Linux itself is not a
Unix-based kernel.  The meaning of "async" to ext2 is apparently very
different than it is to any Unix-based filesystem.  NetBSD might be free
of UNIX(tm) code, but it and its progenitors, right back to the 7th
Edition of the original Unix, were all implemented by people firmly
entrenched in the original Unix heritage from the inside out.

For Unix-based filesystems and their repair tools, any probablility of
recovery less than one is as good as if it were zero.  Don't ever get
your hopes up.  Use "newfs" and "restore" to recover -- it'll be faster
on average in the long term.

Perhaps this sentence from McKusick's memo about fsck will help you to
understand:  "fsck is able to repair corrupted file systems using
procedures based upon the order in which UNIX honors these file system
update requests."  This is true for all Unix-based filesystems.

With MNT_ASYNC there is, by definition, no guarantee about the order of
metadata updates, or even that there will be _any_ metadata updates, and
so there is no possiblity that _any_ algorithm can ever reliably repair
an async-mounted filesystem damaged by a crash.  Use "newfs" to recover.

> Another point that was made was that
> NetBSD ffs fsck was not designed to put a damaged filesystem back
> together, at least the kind of damage one might encounter with async
> mounting.

Exactly.  It is only possible to maintain the on-disk integrity of
Unix-based filesystems to the degree necessary for their guaranteed
repair if, and only if, that filesystem is mounted in such a way that
the system will write all metadata synchronously (or some other
extension such as WAPBL is used to offer the same capability).

> The probability of an async filesystem surviving a crash is
> directly related to
> 
> - how often meta-data is written to the disk from the buffer cache
> - how smart fsck is

Nope.  This is not true for Unix-based filesystems.  "async" never
guarantees to write any metadata, ever, and certainly not in time for a
random crash, and NEVER EVER in any order that would make sense to fsck.

It is _impossible_ for fsck to be smart enough to recover from the
damage caused by an arbitrary crash of a filesystem mounted with
MNT_ASYNC.

Fsck can only work because it is only possible to algorithmically
determine what inconsistencies remain on a disk after a crash IFF the
system is guaranteed to have updated the FS metadata on disk in a
defined and known order.

I suppose some assumptions about the possible on-disk state of a
filesystem could be changed to make it more possible to repair a damaged
MNT_ASYNC-mounted filesystem after a crash, assuming the fact it was
mounted with MNT_ASYNC was securely recorded on-disk, _and_ assuming
there is more tollerance for loss of data during the repair.  That's a
pending research project though, so far as I know, and probably not one
that will garner any attention either, unless you're up to the task
yourself.  I would guess though that some data loss is still virtualy
guaranteed and that it's still going to be faster, overall, to use
"newfs" and "restore" to recover if you need to preserve as much data as
possible.

So, DO NOT use "mount -o async" on any filesystem if you're not prepared
to use "newfs" to repair that filesystem after a crash.

If you want a system which can do some filesystem operations with better
performance than is possible with a Unix-based filesystem doing its
default synchronous writes of metadata then you should consider using
WAPBL ("mount -o log") on NetBSD.  You might get away with "mount -o
softdep" _instead_, but I would very strongly recommend the former, and
never the latter (except of course on FreeBSD).

BTW, I wouldn't expect the re-implementation of ext2 for BSD to meet
your expectations of it's behaviour on Linux either.  "async" is more
likely to mean what it means for NetBSD and FFS than it is to have any
relationship at all to the original Linux ext2 implementation.  This is
due to the level at which filesystems of different types are hooked into
the kernel.  If I understand correctly the implications of "async" are
created above these hooks and so equally affect all filesystems.

(the "umount" option for escaping the dangers of MNT_ASYNC is still
there for you though too, and maybe someday someone will fix "mount -u
-o noasync" (if it is broken), and "mount -u -r", and maybe even sync(2)
as well)

-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.com@localhost>       +1 250 762-7675        http://www.planix.com/

Attachment: pgps1bX2WkZV0.pgp
Description: PGP signature

Follow-Ups:
- Re: Lost file-system story
  - From: Mouse
- Re: Lost file-system story
  - From: David Holland
- Re: Lost file-system story
  - From: Donald Allen

References:
- Lost file-system story
  - From: Donald Allen
- Re: Lost file-system story
  - From: David Holland
- Re: Lost file-system story
  - From: Donald Allen
- Re: Lost file-system story
  - From: Greg A. Woods
- Re: Lost file-system story
  - From: Donald Allen

Prev by Date: Re: Lost file-system story
Next by Date: Re: Debian OpenSSL desaster (was: Patch: new random pseudodevice)
Previous by Thread: Re: Lost file-system story
Next by Thread: Re: Lost file-system story
Indexes:

Home | Main Index | Thread Index | Old Index