tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: What is the best layer/device for a write-back cache based in nvram?



Holland, thank you for your answers.

On Sat, Sep 10, 2016 at 7:41 PM, David Holland <dholland-tech%netbsd.org@localhost> wrote:
> On Fri, Sep 09, 2016 at 11:09:49PM +0200, Jose Luis Rodriguez Garcia wrote:
>
> It sounds like you've already decided what layer it appears in, if
> it's going to be used as a block device.
Is there other option: character device for disks? Sorry I don't understand.

>I guess your question is
>whether it should be integrated into raidframe or sit on top of it?
>My recommendation would be to make a separate entity that's the cache,
>and then add a small amount of code to raidframe to call sideways into
>it when needed. Then you can also add similar code to non-raidframe ld
>or wd/sd devices and you don't end up exposing raidframe internals.

I was thinking to sit on top of raidframe/others and not under them.
I think that it is more easy to do optimizations, as coalescence
writes : by example two writes of 512 bytes in the same
chunk/interleave when many disks have a block size of  4Kbytes.
It isn't top, I would have to create a hook for every write/read, in
every disk driver.

Also I would like to do other optimizations for raidframe, look below
in this mail for x+1, for the case of RAID 5/6, and avoid some reads.

I think that if it is on top it would be less intrusive in other drivers.

What are your motives for you think is better that the cache is under
the disk devices?

Also I have decided it would be a separate diver, not integrated in
raidframe, because it is the preferred solution at tech-kern.

> Some other thoughts:
>
>  - What you're talking about seems like it only really makes sense if
> you have "fast" NVRAM, like PCM (when/if it ever appears in the market
> for real) or battery-backed DRAM. Or maybe if you're using flash in
> front of a small number of spinny-disks. Otherwise the cost of writing
> to the cache, then reading from the cache and writing back to the
> underlying device, is likely to outweigh whatever you save by avoiding
> excess parity I/O.
The list of benefits wasn't ordered by main importance. Only this point don't
justify create this driver. The motivation for me is LFS, reduce latency
 in writes (this can be major benefit for most of the people) and reduce
 number of I/O the raid 5/6.

I think that it is possible to achieve these goals with spinny disks.
For other type
of disks, it will have to be tested if there is advantage.

>
>  - If you want your NVRAM cache to be recoverable (which sounds like
> it's the point) you need to write enough logging data to it to be able
> to recover it. This effectively means doing two writes for every write
> you cache: one with the data and one to remember where the data is.
> You can conceivably batch the metadata writes, but batching those
> suffers from the same issues (small write chunks, frequent syncs,
> etc.) that you're trying to avoid in the RAID so you can't expect it
> to work very well. If both the cache and the RAID are flash, three
> extra I/Os for every block means you have to save at least three I/Os
> in the RAID for every block; that is not likely to be the case. Maybe
> the transfer from the cache to the RAID is invisible and you only need
> to save two, but that still doesn't seem that likely.
>
My principal motivation is cache writes. I think that a cache of between
several megabytes - a few gigabytes will be ok.

Because there isn't much memory used in nvram or what ever the device is
used,the same content of the NVRAM can be stored in the RAM of the
server. Then the NVRAM would be only read for do a "recover" after of a
 crash. In normal use there are only writes to the NVRAM

 Then instead of three I/Os it is two I/O. I think that two I/O on a
fast PCIe device
be faster than a I/O to a spinny disk.

I was thinking in use pcie cards: they can be nvram memory/nvme (4) ... There
are several type of devices, but I would like to discuss this in other
thread, when
I have resolved my doubts of this thread,

>  - The chief benefit of using flash as a frontend cache for spinny
> disks turns out to be that flash is much larger than main memory. This
> works fine if the cache is treated as expendable in crashes (and also
> can be used safely with consumer-grade SSDs that munge themselves when
> the power goes off)... making the cache persistent is expensive and
> only helps with write-heavy workloads that otherwise saturate the
> write bandwidth or that sync a lot. I guess the latter is what you're
> after... but it will still only help in front of spinny disks.
>
Yes, it is the main benefit. I would like that NetBSD can handle big RAIDs of 12
disks in RAID 6. I know that RAID 6 isn't stable and there is GSOC
proposal for test this.


>  - It might also make sense for LFS to assemble segments in "fast"
> NVRAM, although the cost of implementing this will be pretty high. It
> should be able to make use of the same entity I described above -- as
> this should eliminate the need to have another one underneath it, it
> won't be redundant that way.
As I understand LFS after writting partial segments, because of fsync, at
the end writes the space of a full segments. The full slice of a RAID5/6
could be cached.
>
>  - If anyone ever gets around to merging the Harvard journaling FFS,
> which supports external journals, it would be straightforward to put
> that journal on an NVRAM device, "fast" or otherwise. WAPBL doesn't
> really support this though (AFAIK) and doing it won't solve WAPBL's
> other problems (will probably exacerbate them) so isn't all that
> worthwhile.
I didn't heard about it. Could your provide some links about this, by
curiosity?
>
>  - I don't think there's very much to be gained by trying to integrate
> nvram caching with the buffer cache, at least right now. There are
> several reasons for this: (1) the gains vs. having it as a separate
> caching layer aren't that great; (2) the buffer cache interface has no
> notion of persistence, so while it might work for a large
> non-persistent flash cache it won't do anything for the problems
> you're worried about without a fairly substantial redesign; (3) the
> buffer cache code is a godawful mess that needs multiple passes with
> torches and pitchforks before trying to extend it; (4) right now the
> UBC logic is not integrated with the buffer cache interface so one
> would also need to muck with UVM in some of its most delicate and
> incomprehensible parts (e.g. genfs_putpages)... and (5) none of this
> is prepared to cope with buffers that can't be mapped into system
> memory.
>
This wasn't in "my list" of main priorities of this drivers. If this
is difficult to
do, I can do a "read cache" for the blocks of the chucks stored in the
 nvram. The drawback it that a chunk could be stored in buffer cache
previous to the write and be missed.

I repeat this is for the case : RAID 5 the inerleave size is 4 blocks.
of a write of 512 bytes in block x+0 of a column and if I have cached the
read of blocks x+1, x+2 and x+3, I will have only to do :
1 read of parity +1 write of data + 1 write of parity instead of
1 read of data fo read x+1,x+2,x+3,1 read of parity +1 write of data +
1 write of parity


Anyway, I think that it will be for phase 2.


>
>  > A- It must be able to obtain the raid configuration of the raid device
>  > backing the writeback cache. If it is a RAID 0/1 it will cache
>  > portions of the size of the interleave. If it is RAID 5/6 it will
>  > cache the size of a full slice.
>
> As I said above I think the way this ought to work is by the raid code
> calilng into the nvram cache, not the other way around.
>

XXXXXX
>  > B- It can use the buffer cache for avoid read/write cycles, and do
>  > only writes if the data to be read is in memory.
>
> I don't think that makes sense.
It is the case that I describe above for RAID 5/6 for avoid the reads
x+1,x+2,x+3.

>
>  > C- Several devices can share the same write back-cache device ->
>  > optimal and easy to configure. There is not need to hard partitioning
>  > a NVRAM device in smaller devices with one partition over-used and
>  > other infra-used.
>
> That adds a heck of a lot of complexity. Remember you need to be able
> to recover the NVRAM after crashing.
>
The schema of buffers stored in NVRAM could be something like this:
struct buffer_descriptor{
dev_t device;
daddr_t adress;
bitmap_t bitmap_of_cached_blocks<it indicates what blocks have cached writes
}
The nvram stores always the full slice/interleave. The bit map
indicates what are valid blocks.
By example there are 1000 buffer_descriptors and 1000 buffers

I don't see it difficult it to recover, even if there is shared
between several devices. When a device it is attached to the cache it
does the recover as first step, and recover its cached buffers.

Is there some big complexity because a cache is shared between several
devices in autoconf, etc?


Home | Main Index | Thread Index | Old Index