>>>>> "wk" == Wouter Klouwen <dublet%acm.org@localhost> writes: >>>>> "cc" == Christopher Chen <muffaleta%gmail.com@localhost> writes: wk> The only problems occur when the drive fills up, usually wk> around 80%. These problems are crashes, but I've not seen any wk> data loss. I ran it like 0.5 - 1 decade ago and found that it seemed to need some tools around it to be useful as more than /usr/obj scratchspace such as: * a plain defragmenter, or the Holy Grail: access-pattern-observing data-reorganizing cleaner * a log roll-forward agent. at some point it was truncating the log at boot instead of rolling it, meaning that the filesystem was still guaranteed to be consistent after a cord-yank, but some data which was committed to the disk and could have been rescued according to the design, was discarded instead because of missing implementation. this may affect fsync() reliability, not sure. * was NFS working? * an lfsck that can deal with all situations that plausibly arise in practice, and do it without running out of memory I think some of these were added after I stopped using it. The key advantage in practice was blazingly-fast performance for temporally-local workloads like builds or write-only workloads like untarring things. people often used such workloads for informal testing, so LFS would beat the pants off FFS. wk> IMO, there is no alternative for LFS on NetBSD. I've been using ZFS on Solaris, and it's available performantly on FreeBSD. In my experience and from the zfs-discuss list, it's not as safe/stable as UFS/FFS/ext3. The Sun people have riddled it with assertions and panics such that it'll often find something unexpected and simply refuse to let you touch your data. I think they might actually be on the right track with this approach, but it sounds like they haven't arrived yet. There are a bunch of obstructionist appologists over there making a whole littany of excuses for the lost pools casting FUD (some of it reasonable) on every part of the system except ZFS, but the bottom line is, the guys with hundreds of filesystems and formal datacenter SAN setups say they are losing ZFS pools a lot more often than they lost UFS/vxfs filesystems. ZFS is still missing the defragmenter and the lfsck, from the LFS todo list. However, it does consistently beat UFS for performance. and it has all those other fancy features they blogflog, but to me the most interesting feature came up in list discussion recently. ZFS, I think, shares LFS's property that it always recovers to some state the filesystem passed through in the minutes leading up to a crash. POSIX guarantees no such thing, nor is any such thing implied by a working version of POSIX's ``optional'' (ignored on Mac OS X) fsync() call. The ZFS/LFS guarantee makes so much intuitive sense to application developers that they often assume they have it, but they don't, and they've never been promised it, and they've never had it on Unix until now, until this latest batch of filesystems. This guarantee makes unnecessary most of the bizarre tricks sqlite3 does to survive cord-yanking: http://sqlite.org/atomiccommit.html If we can come up with a word for it, achieve it, and start bragging about it, it may become expected, and writing programs might be a lot nicer. You should care about this because BDB does *NOT* do all those sqlite3 tricks. Also, casual things, like if I yank the cord during 'make' do I need to 'make clean' or can I just restart it? The answer for FFS/UFS is that you probably need to 'make clean', while for ZFS or LFS (even without the roll-forward agent) supposedly you can safely restart the build. and of course things like Postfix queues and maildir's and NFSv3 correctness are even more worry-free than before (on ZFS where fsync() works, but not necessarily on LFS), if you were worried about them before which you probably weren't. The downside is that so far filesystems that can keep this promise, like ZFS and LFS (and probably WaFS?), are the filesystems that never overwrite blocks in-place, so they are prone to fragmentation and do particularly badly with database workloads. In fact I expect they'd do badly with any nested-filesystem situation, a filesystem inside a file, like vnconfig or a virtual machine disk image or iSCSI backing store. This is becoming a really common use-case, and AIUI UFS is still beating ZFS on this one. another cool thing about ZFS they don't discuss much is the way the POSIX filesystem layer is separated from the ``transactional object store'' back-end. There are several non-filesystem things which use the back-end directly, such as: * uncached zvol's (iSCSI backing stores), * pNFS (unfinished), * Lustre (unfinished). I'm dreaming someone will one day invent a non-POSIX storage API which databases can use to solve the performance problem while remaining copy-on-write and snapshottable. zvol's are already working on FreeBSD at least, in fact over there they are constantly misunderstanding how you're supposed to use ZFS and putting UFS's inside zvol's because they think ZFS is like geom. wk> WAPBL may do what you want, it may, but does not share this ZFS/LFS consistency characteristic I just described. The filesystems that do metadata logging (ext3, xfs, hfs+journal) don't. With LFS's log-STRUCTURED, all the data goes into the log, and with ZFS all data is copy-on-write while most journaled filesystems put only metadata in the log and overwrite data in the traditional way. Metadata-only logging also doesn't share all the performance characteristics (neither the good ones nor the bad ones :). cc> this happened before I think when we went to cc> UBC (I think so, anyway! I don't think it was UVM). IIRC for a while it was working in general, but mmap() was not working. I think that was fixed but don't rember for sure. cc> Konrad picked it up and got it more or less working by 2, but cc> yeah, I imagine it's not a priority for him. was anyone besides Konrad ever working on it? seems kind of not sustainable. Anyway to me it almost looks as though we're poised to skip right over the 1 - 5 TB filesystems that FFS+softdep can't handle, towards 100TB filesystems created and consumed by large clusters. Catching up with where Linux was a decade ago is hardly interesting. Some things to consider as you scale filesystems to extremely large numbers of spindles, is that not only fsck time matters: * there will be an O(n) ``scrub'' operation to read every block. This is non-optional: RAID's need this operation to catch 1 bad disk before it turns into 2 bad disks. You cannot just write your data and expect it to quietly stay there, never reading it, and expect to be magically interrupted with a hardware error-report if it ever disappears. NetApp scrubs every week. Make sure your scrub is able to run to completion, ~weekly, without killing performance. If scrub kills performance it'll have to run very very fast. and you cannot reset scrub and start it from 0 upon operations that happen frequently, for example ZFS used to upon taking a snapshot until they fixed it. * when disks go bad, can you resilver them? For example, suppose you come up with something having 100 disks, taking 1 week to resilver, and only 1 disk can resilver at a time. SATA annual failure rate is ~2%/yr and probably much higher in the first month. this isn't workable. With wide arrays you need to not only tolerate multiple failures but be able to run multiple concurrent resilvers. and still, remember, you must be able to run scrub to completion---you cannot interrupt scrub and restart it from 0 upon resilver (as ZFS currently does) if you have so many disks in the pool that you're always resilvering something. * backup and restore. suppose you back up to another pool of the same size, and you have really good incremental backps, ``replication'' like NetApp has. great, but is the pool so big that it's DEPENDENT on the incremental feature to backup in a human-scale amount of time, that it'll take a month to restore it completely from backup? In that case you will have to run two pools, and upon failure be prepared to invert their roles and replicate in the other direction. with such a long restore window maybe 2 pools isn't enough. In ahy case you can't just do a plain old restore like in the old days with such large pools. ultimately I think soemthing like ZFS will not scale because it bottlenecks all data through a single kernel image. It will have to be something using SCSI multiple-initiator features like OCFS, or else a big filesystem above many small filesystems like Lustre, pNFS, GlusterFS. A better design might be to duplicate the ZFS split between object-store and POSIX fs, but crack it across a network layer, and build something Lustre-like to begin with that operates in a degraded localhost mode when you want a plain filesystem. the other thing that's needed is a native NAND flash filesystem with really good cord-yank characteristics (Linux JFFS2 does no write buffering at all). but, JFFS2 isn't working well with the current designs where FLASH size >> RAM size.
Attachment:
pgpVajmChcMXv.pgp
Description: PGP signature