tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-03 18:11 GMT+01:00 David Holland <>:
> Yes and no; there's also standard terminology for talking about
> caches, so my inclination would be to call it something like
> B_MEDIASYNC: synchronous at the media level.

Okay, this might be good. Words better then acronyms :)

>  > For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
>  > for the same functionality, but not sure if it matches with  what swap
>  > is using this flag for. DPO is ideal for journal writes however,
>  > that's why I want to add the support for it now.
> What does DPO do?

It tells the hardware to not store the data into it's cache. Or more
precisely, to not put it into a cache if it meant that it would have
to evict something else from it. It should improve general performance
of the disk - the journal writes will not trash the device cache.


> Perfect abstractions for any of these would be more complex, but
> barriers serve pretty well.

Perhaps we could start with reworking DIOCCACHESYNC into a barrier :)
Currently it is not actually guaranteed to be executed after the
already queued writes - the ioctl is executed out of bounds, bypassing
the bufq queue. Hence it doesn't actually quite work if there are any
in-fligh async writes, as queued e.g. by the bawrite() calls in

In linux the block write interface accounts for that, there are flags
to ask a sync to be done before or after the I/O, and it is also
possible to send just empty I/O with only the sync flags. Thus the
sync is always queued along with the writes. It would be good to adopt
something like this, but that would require bufq interface changes and
possibly device driver changes, with much broader tree disturbance.

At least with FUA the caller can ensure to have all the writes safely
on media, and wouldn't depend on an out-of-bound ioctl.

> Single synchronous block writes are a bad way to implement barriers
> and it maybe makes sense to have two models and force every fs to be
> able to do things two different ways; but single synchronous block
> writes are also a bad way to implement any of the above invariants.
> E.g. I'm not convinced that writing out journal blocks synchronously
> one at a time will be faster than flushing the cache at the end of a
> journal write, even though the latter inflicts collateral damage in
> the sense of waiting for perhaps many blocks that don't need to be
> waited for.

Indeed - writing journal blocks sync one by one is unlikely to be
faster then sending them all async and doing cache flush on the end,
that wouldn't make sense.

I plan to change WAPBL to do the journal writes partially async. It
will use several bufs, issue the I/O asynchronously and only wait if
it runs out of buffers, or it it needs to do the commit. Seems usually
there are three or four block writes done as part of the transaction
commit, so there is decent parallelism opportunity.

> I guess it would help if I knew what you were intending to do with
> wapbl in this regard; have you posted that? (I've been at best
> skimming tech-kern the past few months...)

I haven't posted details on the WAPBL part of the changes. I'll put
together a patch over the weekend, and send it over. It will be useful
to show my thinking how the proposed interface could be used.

> Getting it working first is great but I'm not sure a broadly exposed
> piece of infrastructure should be committed in a preliminary design
> state... especially in a place (the legacy buffer cache) that's
> already a big ol' mess.

That's one of reasons I want to keep the current changes minimal :)

The proposed patch doesn't actually touch the legacy buffer cache code
at all. It only adds another B_* flag, and changes hardware device
drivers to react upon it. The flag is supposed to be set by the
caller, for example by WAPBL itself. Nothing in e.g. ffs would set the

> I guess what worries me is the possibility of getting interrupted by
> real life and then all this remaining in a half-done state in the long
> term; there are few things worse for maintainability in general than
> half-finished reorgs that end up getting left to bitrot. :-/

There is semi-good chance this will be finished into workable state
soon - I picked up jornaling improvements as my Bachelor thesis
material, so it either gets done or I will fail :)

> Is there something more generic / less hardware-specific that we can
> put in the fs in the near term?

Well, the FUA support looks like a good candidate for being useful and
could have direct positive performance impact, so I picked up that. If
we have code taking advantage of FUA, it adds (another) incentive to
actually integrate AHCI NCQ support, as that is the only way how to
get FUA support on more contemporary hardware. Also, it's my
understanding using FUA instead of full cache sync should be huge win
for raid also, so it's worth for that avenue too.

> keep in mind that whatever it is might end up in -8 and needing to be
> maintained for years...

I know.

>  > Yes, that funny bwrite() not being real bwrite() until issued for
>  > second time from WAPBL :) Quite ugly. It's shame the B_LOCKED hack is
>  > not really extensible to cover also data in journal, as it holds all
>  > transaction data in memory.
> B_LOCKED doesn't seem to be documented :(

It used to be used for something for LFS, then got repurposed for WAPBL.


Home | Main Index | Thread Index | Old Index