Re: state or future of LFS?

To: netbsd-users%netbsd.org@localhost
Subject: Re: state or future of LFS?
From: Miles Nordin <carton%Ivy.NET@localhost>
Date: Fri, 10 Apr 2009 15:01:54 -0400
>>>>> "wk" == Wouter Klouwen <dublet%acm.org@localhost> writes:
>>>>> "cc" == Christopher Chen <muffaleta%gmail.com@localhost> writes:

    wk> The only problems occur when the drive fills up, usually
    wk> around 80%. These problems are crashes, but I've not seen any
    wk> data loss.

I ran it like 0.5 - 1 decade ago and found that it seemed to need some
tools around it to be useful as more than /usr/obj scratchspace such as:

  * a plain defragmenter, or the Holy Grail: access-pattern-observing
    data-reorganizing cleaner

  * a log roll-forward agent.  at some point it was truncating the log
    at boot instead of rolling it, meaning that the filesystem was
    still guaranteed to be consistent after a cord-yank, but some data
    which was committed to the disk and could have been rescued
    according to the design, was discarded instead because of missing
    implementation.  this may affect fsync() reliability, not sure.

  * was NFS working?

  * an lfsck that can deal with all situations that plausibly arise in
    practice, and do it without running out of memory

I think some of these were added after I stopped using it.

The key advantage in practice was blazingly-fast performance for
temporally-local workloads like builds or write-only workloads like
untarring things.  people often used such workloads for informal
testing, so LFS would beat the pants off FFS.

    wk> IMO, there is no alternative for LFS on NetBSD.

I've been using ZFS on Solaris, and it's available performantly on
FreeBSD.  In my experience and from the zfs-discuss list, it's not as
safe/stable as UFS/FFS/ext3.  The Sun people have riddled it with
assertions and panics such that it'll often find something unexpected
and simply refuse to let you touch your data.  I think they might
actually be on the right track with this approach, but it sounds like
they haven't arrived yet.  There are a bunch of obstructionist
appologists over there making a whole littany of excuses for the lost
pools casting FUD (some of it reasonable) on every part of the system
except ZFS, but the bottom line is, the guys with hundreds of
filesystems and formal datacenter SAN setups say they are losing ZFS
pools a lot more often than they lost UFS/vxfs filesystems.  ZFS is
still missing the defragmenter and the lfsck, from the LFS todo list.

However, it does consistently beat UFS for performance.

and it has all those other fancy features they blogflog, but to me the
most interesting feature came up in list discussion recently.  ZFS, I
think, shares LFS's property that it always recovers to some state the
filesystem passed through in the minutes leading up to a crash.  POSIX
guarantees no such thing, nor is any such thing implied by a working
version of POSIX's ``optional'' (ignored on Mac OS X) fsync() call.

The ZFS/LFS guarantee makes so much intuitive sense to application
developers that they often assume they have it, but they don't, and
they've never been promised it, and they've never had it on Unix until
now, until this latest batch of filesystems.  This guarantee makes
unnecessary most of the bizarre tricks sqlite3 does to survive
cord-yanking:

 http://sqlite.org/atomiccommit.html

If we can come up with a word for it, achieve it, and start bragging
about it, it may become expected, and writing programs might be a lot
nicer.

You should care about this because BDB does *NOT* do all those sqlite3
tricks.  Also, casual things, like if I yank the cord during 'make' do
I need to 'make clean' or can I just restart it?  The answer for
FFS/UFS is that you probably need to 'make clean', while for ZFS or
LFS (even without the roll-forward agent) supposedly you can safely
restart the build.  and of course things like Postfix queues and
maildir's and NFSv3 correctness are even more worry-free than before
(on ZFS where fsync() works, but not necessarily on LFS), if you were
worried about them before which you probably weren't.

The downside is that so far filesystems that can keep this promise,
like ZFS and LFS (and probably WaFS?), are the filesystems that never
overwrite blocks in-place, so they are prone to fragmentation and do
particularly badly with database workloads.  In fact I expect they'd
do badly with any nested-filesystem situation, a filesystem inside a
file, like vnconfig or a virtual machine disk image or iSCSI backing
store.  This is becoming a really common use-case, and AIUI UFS is
still beating ZFS on this one.

another cool thing about ZFS they don't discuss much is the way the
POSIX filesystem layer is separated from the ``transactional object
store'' back-end.  There are several non-filesystem things which use
the back-end directly, such as: 

 * uncached zvol's (iSCSI backing stores), 

 * pNFS (unfinished), 

 * Lustre (unfinished).

I'm dreaming someone will one day invent a non-POSIX storage API which
databases can use to solve the performance problem while remaining
copy-on-write and snapshottable.  zvol's are already working on
FreeBSD at least, in fact over there they are constantly
misunderstanding how you're supposed to use ZFS and putting UFS's
inside zvol's because they think ZFS is like geom.

    wk> WAPBL may do what you want,

it may, but does not share this ZFS/LFS consistency characteristic I
just described.  The filesystems that do metadata logging (ext3, xfs,
hfs+journal) don't.  With LFS's log-STRUCTURED, all the data goes into
the log, and with ZFS all data is copy-on-write while most journaled
filesystems put only metadata in the log and overwrite data in the
traditional way.  Metadata-only logging also doesn't share all the
performance characteristics (neither the good ones nor the bad ones
:).

    cc> this happened before I think when we went to
    cc> UBC (I think so, anyway! I don't think it was UVM). 

IIRC for a while it was working in general, but mmap() was not
working.  I think that was fixed but don't rember for sure.

    cc> Konrad picked it up and got it more or less working by 2, but
    cc> yeah, I imagine it's not a priority for him.

was anyone besides Konrad ever working on it?  seems kind of not
sustainable.

Anyway to me it almost looks as though we're poised to skip right over
the 1 - 5 TB filesystems that FFS+softdep can't handle, towards 100TB
filesystems created and consumed by large clusters.  Catching up with
where Linux was a decade ago is hardly interesting.  Some things to
consider as you scale filesystems to extremely large numbers of
spindles, is that not only fsck time matters:

 * there will be an O(n) ``scrub'' operation to read every block.
   This is non-optional: RAID's need this operation to catch 1 bad
   disk before it turns into 2 bad disks.  You cannot just write your
   data and expect it to quietly stay there, never reading it, and
   expect to be magically interrupted with a hardware error-report if
   it ever disappears.  NetApp scrubs every week.  Make sure your
   scrub is able to run to completion, ~weekly, without killing
   performance.  If scrub kills performance it'll have to run very
   very fast.  and you cannot reset scrub and start it from 0 upon
   operations that happen frequently, for example ZFS used to upon
   taking a snapshot until they fixed it.

 * when disks go bad, can you resilver them?  For example, suppose you
   come up with something having 100 disks, taking 1 week to resilver,
   and only 1 disk can resilver at a time.  SATA annual failure rate
   is ~2%/yr and probably much higher in the first month.  this isn't
   workable.  With wide arrays you need to not only tolerate multiple
   failures but be able to run multiple concurrent resilvers.

   and still, remember, you must be able to run scrub to
   completion---you cannot interrupt scrub and restart it from 0 upon
   resilver (as ZFS currently does) if you have so many disks in the
   pool that you're always resilvering something.

 * backup and restore.  suppose you back up to another pool of the
   same size, and you have really good incremental backps,
   ``replication'' like NetApp has.  great, but is the pool so big
   that it's DEPENDENT on the incremental feature to backup in a
   human-scale amount of time, that it'll take a month to restore it
   completely from backup?  In that case you will have to run two
   pools, and upon failure be prepared to invert their roles and
   replicate in the other direction.  with such a long restore window
   maybe 2 pools isn't enough.  In ahy case you can't just do a plain
   old restore like in the old days with such large pools.

ultimately I think soemthing like ZFS will not scale because it
bottlenecks all data through a single kernel image.  It will have to
be something using SCSI multiple-initiator features like OCFS, or else
a big filesystem above many small filesystems like Lustre, pNFS,
GlusterFS.  A better design might be to duplicate the ZFS split
between object-store and POSIX fs, but crack it across a network
layer, and build something Lustre-like to begin with that operates in
a degraded localhost mode when you want a plain filesystem.

the other thing that's needed is a native NAND flash filesystem with
really good cord-yank characteristics (Linux JFFS2 does no write
buffering at all).  but, JFFS2 isn't working well with the current
designs where FLASH size >> RAM size.
Attachment: pgpVajmChcMXv.pgp
Description: PGP signature
Follow-Ups:
- Re: state or future of LFS?
  - From: Chuck Swiger
References:
- state or future of LFS?
  - From: Niels Dettenbach
- Re: state or future of LFS?
  - From: Wouter Klouwen
- Re: state or future of LFS?
  - From: Niels Dettenbach
- Re: state or future of LFS?
  - From: Wouter Klouwen
Prev by Date: Re: state or future of LFS?
Next by Date: Re: state or future of LFS?
Previous by Thread: Re: state or future of LFS?
Next by Thread: Re: state or future of LFS?
Indexes:
Home | Main Index | Thread Index | Old Index