tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal to enable WAPBL by default for 10.0



> Date: Thu, 23 Jul 2020 07:45:08 +0200
> From: Michał_Górny <mgorny%gentoo.org@localhost>
> 
> On Thu, 2020-07-23 at 05:17 +0000, David Holland wrote:
> > The problem is that because it still doesn't do anything about
> > journaling or preserving file contents, but runs a lot faster, it
> > loses more data when interrupted.
> 
> How does that compare to the level of damage non-journaled FFS takes?
>  My VM was just bricked a second time because /etc/passwd was turned to
> junk.  I dare say that a proper metadata journaling + proper writes
> (i.e. using rename() -- haven't verified whether that's done correctly)
> should prevent that from happening again.

Metadata journaling doesn't do anything about that, and it never has.

It is a common misconception that metadata journaling has anything to
do with making a system more robust against data corruption.

Metadata journaling is primarily about making it _faster_ to pick up
after interruption such as crash or power failure, and faster to issue
writes in the first place at the cost of doubling the number of
metadata writes.

- In traditional ffs, every operation issues metadata writes
  synchronously in a particular order.

  This way, if an operation is interrupted, then on reboot, `fsck -p'
  can reliably identify what state the file system was in, and either
  roll back to undo the operation or roll forward to complete it.

  Of course, identifying that state requires doing a global analysis
  of the file system structure, so it's slow, and the larger the file
  system is the slower it gets.

  (Note: `fsck -p' is part of the file system design; fsck _without_
  `-p' is pray-to-recover from `unexpected inconsistencies' arising
  either from bugs or from hardware failures.)

- With wapbl, every operation issues metadata writes in order _twice_:
  first to a sequential log and then -- once all the writes to the log
  for the operation have been committed to disk -- to the locations
  where the metadata blocks actually live.

  This way, if an operation is interrupted, then on reboot, log replay
  can reliably roll forward operations whose metadata writes were
  committed in the log, and discard the rest to roll back operations
  whose metadata writes were not committed.

  Log replay takes time proportional roughly to the number of
  in-flight operations rather than to the size of the file system, so
  it's much cheaper than the global analysis of `fsck -p' for large
  disks.

  wapbl only requires the metadata writes to be serialized -- not
  synchronous -- so even though it issues every metadata write twice,
  it tends to have much higher write throughput (especially on
  spinning rust) since metadata writes don't happen in lock-step with
  the disk write latency.

Of course, the devil is in the details, and wapbl is actually more
complicated than that, and we screwed up the on-disk format ages ago.
So wapbl has various shortcomings, like crashes when the number of
metadata writes needed to atomically truncate a large file exceeds the
free space left in the log on disk because we failed to guarantee
every operation runs in (small) constant log space and preallocate
enough space up front.

ffs also has a long-standing bug I call the `garbage data appended
after crash' bug: when you append data to a file, ffs will
_synchronously_ allocate data blocks and update the inode length, and
_asynchronously_ write the data to the new blocks.  If interrupted,
the new blocks may be allocated and the inode length updated, but the
new blocks may contain garbage because the asynchronous data writes
haven't completed yet.  The result is that it's as if you appended
garbage data to the end of the file.  You can work around it by
writing to a temporary file, fsyncing the temporary file, and renaming
to the permanent location, but it's a bug nevertheless.

wapbl makes this bug _worse_ by issuing the metadata writes much
faster -- since they only need to be serialized, not synchronous -- so
the bug can apply to many more files and much more data.

All of this is to say: wapbl -- and journaling generally -- doesn't do
anything more than ffs to change the `level of damage' in any
qualitative way; but both traditional ffs and ffs+wapbl have something
that you might call a `data loss' bug (more accurately, file
corruption), and it's quantitatively _worse_ for wapbl.

So I'm not clear on where kamil gets the idea that wapbl is less prone
to data loss, and the symptom you (mgorny) described is consistent
with the bug that wapbl makes worse.


(There are various ways we _could_ approach the shortcomings of ffs
and wapbl: impose ordering constraints on data writes to fix the
garbage data appended after crash bug (`soft updates'), for example;
create new types of logical log entries to atomically truncate inodes
so that truncation can run in constant log space; do bookkeeping for
wapbl transactions better so we never run out of space.  But some of
these require changes to the on-disk format, and overall it's a lot of
work...which is why I used to use ffs+sync on my laptop, and these
days I avoid ffs altogether in favour of zfs and lfs, except on
install images written to USB media.)


Home | Main Index | Thread Index | Old Index