tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

On Thu, Mar 02, 2017 at 09:11:17PM +0100, Jarom?r Dole?ek wrote:
 > > Some quick thoughts, though:
 > >
 > > (1) ultimately it's necessary to patch each driver to crosscheck the
 > > flag, because otherwise eventually there'll be silent problems.
 > Maybe. I think I like having this as responsibility on the caller for
 > now, avoids too broad tree changes. Ultimately it might indeed be
 > necessary, if we find out that it can't be reasonably be handled by
 > the caller. Like maybe raidframe kicking in spare disk without FUA
 > into set with FUA.

It's more like "in the long run it is hazardous to assume the
upper-level code is perfectly correct".

 > > (2) it would be better not to expose hardware-specific flags in the
 > > buffercache, so it would be better to come up with a name that
 > > reflects the semantics, and a semantic guarantee that's at least
 > > notionally not hardware-specific.
 > I want to avoid unnecessary private NetBSD nomenclature. If storage
 > industry calls it FUA, it's probably good to just call it FUA.

Yes and no; there's also standard terminology for talking about
caches, so my inclination would be to call it something like
B_MEDIASYNC: synchronous at the media level.

 > For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
 > for the same functionality, but not sure if it matches with  what swap
 > is using this flag for. DPO is ideal for journal writes however,
 > that's why I want to add the support for it now.

What does DPO do?

 > > (3) as I recall (can you remind those of us not currently embedded in
 > > this stuff what the semantics of FUA actually are?) FUA is *not* a
 > > write barrier (as in, all writes before happen before all writes
 > > after) and since write barriers are a natural expression of the
 > > requirements for many fses, it would be well to make sure the
 > > implementation of this doesn't conflict with that.
 > FUA doesn't enforce any barriers. It merely changes the sematics of
 > the write request - the hardware will return success response only
 > after the data is written to non-volatile media.
 > Any barriers required by filesystem sematics need to be handled by the
 > fs code, same as now with DIOCCACHESYNC.
 > I've talked about adding some kind of generic barrier support in the
 > previous thread. After thinking about it, and reading more, I'm not
 > convinced it's necessary. Incidentally, Linux has moved away from the
 > generic barriers and pushed the logic into their fs code, which can
 > DTRT with e.g. journal transactions, too.

The reason barriers keep coming up is that barriers express the
requirements of filesystems reasonably well; e.g. for a journaling
filesystem, the requirement is that when you write a bunch of journal
blocks, they must all become permanent before any following blocks.
Similarly, for a snapshot/shadow-paging based fs like zfs, you write a
whole bunch of stuff, then a new superblock (which must come strictly
after) and then you go on. And for log-structured fses, generally you
want all the blocks from one segment to be written before any of the

Perfect abstractions for any of these would be more complex, but
barriers serve pretty well.

Single synchronous block writes are a bad way to implement barriers
and it maybe makes sense to have two models and force every fs to be
able to do things two different ways; but single synchronous block
writes are also a bad way to implement any of the above invariants.
E.g. I'm not convinced that writing out journal blocks synchronously
one at a time will be faster than flushing the cache at the end of a
journal write, even though the latter inflicts collateral damage in
the sense of waiting for perhaps many blocks that don't need to be
waited for.

I guess it would help if I knew what you were intending to do with
wapbl in this regard; have you posted that? (I've been at best
skimming tech-kern the past few months...)

 > > (3a) Also, past discussion of this stuff has centered around trying to
 > > identify a single coherent interface for fs code to use, with the
 > > expansion into whatever hardware semantics are available happening in
 > > the bufferio layer. This would prevent needing conditional logic on
 > > device features in every fs. However, AFAICR these discussions have
 > > never reached any clear conclusion. Do you have any opinion on that?
 > I think that I'd like to have at least two different places in kernel
 > needing particular interface before generalizing this into a bufferio
 > level. Or at minimum, I'd like to have it working on one place
 > correctly, and then it can be generalized before using it on second
 > place. It would be awesome to use FUA e.g. for fsync(2), but let's not
 > get too ahead of ourselves.

Well, if I counted correctly we have seventeen on-disk filesystems (if
you count wapbl separately) and while one of them's read-only, all the
others need to manipulate the disk cache. Most of them currently don't
and are thus just wrong. Some of them are probably unfixable in that
they're implementations of historic things that don't have a recovery
model, but there are at least seven that are supposed to be
recoverable (traditional ffs, lfs, nilfs, ntfs, udf, wapbl, zfs; plus
also any of the new things on the project list) so there's no shortage
of potential clients.

Getting it working first is great but I'm not sure a broadly exposed
piece of infrastructure should be committed in a preliminary design
state... especially in a place (the legacy buffer cache) that's
already a big ol' mess.

 > We don't commit too much right now besides a B_* flag. I'd rather to
 > keep this raw and lean for now, and  concentrate on fixing the device
 > drivers to work with the flags correctly. Only then maybe come up with
 > interface to make it easier for general use.
 > I want to avoid broadening the scope too much. Especially since I want
 > to introduce SATA NCQ support within next few months, which might need
 > some tweaks to the semantics again.

I guess what worries me is the possibility of getting interrupted by
real life and then all this remaining in a half-done state in the long
term; there are few things worse for maintainability in general than
half-finished reorgs that end up getting left to bitrot. :-/

Is there something more generic / less hardware-specific that we can
put in the fs in the near term?

keep in mind that whatever it is might end up in -8 and needing to be
maintained for years...

 > > We don't want to block improvements to wapbl while we figure out the
 > > one true device interface, but on the other hand I'd rather not
 > > acquire a new set of long-term hacks. Stuff like the "logic" wapbl
 > > uses to intercept the synchronous writes issued by the FFS code is
 > > very expensive to get rid of later.
 > Yes, that funny bwrite() not being real bwrite() until issued for
 > second time from WAPBL :) Quite ugly. It's shame the B_LOCKED hack is
 > not really extensible to cover also data in journal, as it holds all
 > transaction data in memory.

B_LOCKED doesn't seem to be documented :(

David A. Holland

Home | Main Index | Thread Index | Old Index