tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: What is the best layer/device for a write-back cache based in nvram?



On Fri, Sep 09, 2016 at 11:09:49PM +0200, Jose Luis Rodriguez Garcia wrote:
 > This is a continuation of the thread Is factible to implement full
 > writes of stripes to raid using NVRAM memory in LFS.
 > http://mail-index.netbsd.org/tech-kern/2016/08/18/msg020982.html
 > 
 > I want to discuss in what layer must be located a write back-cache. It
 > will be used usually for for raid configurations as a general purpose
 > device: any type of filesystem or raw.

It sounds like you've already decided what layer it appears in, if
it's going to be used as a block device. I guess your question is
whether it should be integrated into raidframe or sit on top of it?
My recommendation would be to make a separate entity that's the cache,
and then add a small amount of code to raidframe to call sideways into
it when needed. Then you can also add similar code to non-raidframe ld
or wd/sd devices and you don't end up exposing raidframe internals.

Some other thoughts:

 - What you're talking about seems like it only really makes sense if
you have "fast" NVRAM, like PCM (when/if it ever appears in the market
for real) or battery-backed DRAM. Or maybe if you're using flash in
front of a small number of spinny-disks. Otherwise the cost of writing
to the cache, then reading from the cache and writing back to the
underlying device, is likely to outweigh whatever you save by avoiding
excess parity I/O.

 - If you want your NVRAM cache to be recoverable (which sounds like
it's the point) you need to write enough logging data to it to be able
to recover it. This effectively means doing two writes for every write
you cache: one with the data and one to remember where the data is.
You can conceivably batch the metadata writes, but batching those
suffers from the same issues (small write chunks, frequent syncs,
etc.) that you're trying to avoid in the RAID so you can't expect it
to work very well. If both the cache and the RAID are flash, three
extra I/Os for every block means you have to save at least three I/Os
in the RAID for every block; that is not likely to be the case. Maybe
the transfer from the cache to the RAID is invisible and you only need
to save two, but that still doesn't seem that likely.

 - The chief benefit of using flash as a frontend cache for spinny
disks turns out to be that flash is much larger than main memory. This
works fine if the cache is treated as expendable in crashes (and also
can be used safely with consumer-grade SSDs that munge themselves when
the power goes off)... making the cache persistent is expensive and
only helps with write-heavy workloads that otherwise saturate the
write bandwidth or that sync a lot. I guess the latter is what you're
after... but it will still only help in front of spinny disks.

 - If you have "fast" NVRAM it won't be particularly large. (Maybe
someday PCM or memristors or something will be substantially cheaper
than DRAM and only marginally slower, but that doesn't seem too
likely, and it certainly isn't the case today or likely to happen
anytime soon.) This means that the volume of writes it can absorb will
be fairly limited. However, it'll still probably be at least somewhat
useful for workloads that sync a lot. The catch is that PCM and
memristors and whatnot don't actually exist yet in useful form, and
while battery-backed DRAM does in principle, such hardware isn't
readily available so the virtues of supporting it are limited.

 - It might also make sense for LFS to assemble segments in "fast"
NVRAM, although the cost of implementing this will be pretty high. It
should be able to make use of the same entity I described above -- as
this should eliminate the need to have another one underneath it, it
won't be redundant that way.

 - If anyone ever gets around to merging the Harvard journaling FFS,
which supports external journals, it would be straightforward to put
that journal on an NVRAM device, "fast" or otherwise. WAPBL doesn't
really support this though (AFAIK) and doing it won't solve WAPBL's
other problems (will probably exacerbate them) so isn't all that
worthwhile.

 - I don't think there's very much to be gained by trying to integrate
nvram caching with the buffer cache, at least right now. There are
several reasons for this: (1) the gains vs. having it as a separate
caching layer aren't that great; (2) the buffer cache interface has no
notion of persistence, so while it might work for a large
non-persistent flash cache it won't do anything for the problems
you're worried about without a fairly substantial redesign; (3) the
buffer cache code is a godawful mess that needs multiple passes with
torches and pitchforks before trying to extend it; (4) right now the
UBC logic is not integrated with the buffer cache interface so one
would also need to muck with UVM in some of its most delicate and
incomprehensible parts (e.g. genfs_putpages)... and (5) none of this
is prepared to cope with buffers that can't be mapped into system
memory.

 - Note that because zfs has its own notions of disk arrays, and its
own notions of caching, you might be able to add something like this
to zfs more easily, at the cost of it working only with zfs and maybe
interacting badly with the rest of the system.

 > 1- There is no need to use parity map for the RAID 1/10/5/6. Usually
 > the impact is small, but it can be noticeable in busy servers.
 >   a) There isn't parity to rebuild. The parity is always up to date.
 > Less down time in case of os crash / power failure / hardware failure
 >   b) Better performance for RAID 1/5/6. It isn't necessary to update
 > the parity map because they don't exist.

Remember that you still need to write metadata equivalent to the
parity map to the NVRAM, because you have to be able to recover the
NVRAM and figure out what's on it after crashing.

 > 2- In scattered writes contained in a same slice, it allows to reduce
 > the number of writes. With RAID 5/6 there is a advantage, the parity
 > is written only one time for several writes in the same slice, instead
 > of one time for every write in the same slice.

Also, is the cache itself a RAID? If so, you have the same problem
recursively; if not, you lose the redundant mirroring until the data
is transferred out. Maybe the cache can always be a RAID 1 though.

 > A- It must be able to obtain the raid configuration of the raid device
 > backing the writeback cache. If it is a RAID 0/1 it will cache
 > portions of the size of the interleave. If it is RAID 5/6 it will
 > cache the size of a full slice.

As I said above I think the way this ought to work is by the raid code
calilng into the nvram cache, not the other way around.

 > B- It can use the buffer cache for avoid read/write cycles, and do
 > only writes if the data to be read is in memory.

I don't think that makes sense.

 > C- Several devices can share the same write back-cache device ->
 > optimal and easy to configure. There is not need to hard partitioning
 > a NVRAM device in smaller devices with one partition over-used and
 > other infra-used.

That adds a heck of a lot of complexity. Remember you need to be able
to recover the NVRAM after crashing.

 > These are the three options proposed by Thor. I would like to know
 > what is the best option for you:
 > 
 > 1- Implement in a generic pseudodisk the write-back cache. This
 > pseuododisk is attached on a raid/disk/etc. This is also the option
 > suggested by Greg.
 > 
 > It seems the option more recommended in the previous thread.
 > 
 > 2- Add this to Raidframe.
 > 
 > Is it more easy to implement/to integrate with Raidframe?. The raid
 > configurations are contained in the same driver.
 > 
 > It can be more easy for a sysadmin to configure: less
 > devices/commands, and not prune to corruption errors: there isn't a
 > device with write-back cache and the same device without write-back
 > cache.
 > For non raid devices It can be used as raid 0 of one disk.
 > 
 > 3- LVM. I don't see special advantage in this option.

See above. I think what I suggested is a mixture of (1) and (2) and
preferable to either.

-- 
David A. Holland
dholland%netbsd.org@localhost


Home | Main Index | Thread Index | Old Index