tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: B_ARRIER (addresses wapbl performance?)

On Tue, Dec 09, 2008 at 12:12:32PM -0800, Jason Thorpe wrote:
> On Dec 9, 2008, at 11:00 AM, Manuel Bouyer wrote:
>> But I don't get why we want to have the journal on stable storage  
>> *now*.
>>> From what I understand, what we want is to have the journal entry for
>> this transaction to stable storage *before* the metadata changes start 
>> to hit
>> stable storage, and the journal cleanup hits stable storage *after*
>> all metadata changes are on stable storage. I can't see what FUA  
>> brings
>> us here, as long as writes to stable storage are properly ordered.
> You want your journal to remain self-consistent, otherwise you can't  
> trust it to replay it.

The performance part of the argument is separate from that, and is the
reason that using just ordering constraints is not "enough", in the
sense Manuel was after.. 

Ordering constraints would produce a self-consistent journal. However,
using ordered journal writes would also force other (unrelated)
pending unordered writes out as well, adding latency to the
journalling.  The journalling is already an overhead penalty (more
total writes than would otherwise be required if dependencies could be
fully ordered) and needs to jump the queue if it is to offer a latency
reduction (reduced number of writes needed for *this* transaction to
be acknowledged to upper layers).

> Also, without explicit cache management, you can't be sure when the data 
> gets written out of the drive's cache.  Again, command completion has 
> nothing to do with how the writes are ordered to the oxide.

And getting to oxide, specifically versus other suitable stable
storage like nvram, has nothing to do with journalling, until you're
at the point of caring about visibility differences for the cache vs
the oxide -- as clearly you are in the cluster case. Even in the
cluster case, you're less concerned about oxide (if there even is any)
than you are about shared stable storage.

With a volatile cache, you know you need to get past it. With a fast
non-volatile cache in the right place, you can get great performance
for fully ordered writes and forgo journalling entirely.  Trying to
design something generic for the mix of conditions in between gets us
into discussions like this :-)

The trouble is really knowing enough about the IO topology (of which
there may be several layers, each with their own cache - host
controller, SAN raid controller, external JBOD disk, etc) to be sure
what you're getting on completion: are write caches volatile? Are
there differing visibility or fault domains involved at each layer?
How far do you need to get through these layers before you have
"enough" commitment for your specific needs, and at what cost?

ZFS had lots of performance troubles with some SAN and external RAID
systems at one stage, because they were treating them like local disks
where they had full control of cache contents, and issuing syncs to
complete and close one previous transaction in a way that could force
cache flushes for many other unrelated writes for future transactions,
or committed-to-nvram writes from previous transactions, or even
totally unrelated activity in other filesystems.  That's been fixed
through a mix of clearing up interpretation differences about what is
needed for stable storage and what cache flushing does, and most
importantly by allowing the admin and system designer to be explicit
about io topology using dedicated read and write cache devices that
zfs can manage directly.


Attachment: pgpoyfczak5vr.pgp
Description: PGP signature

Home | Main Index | Thread Index | Old Index