tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: WAPBL/cache flush and mfi(4)



On Fri, Aug 24, 2012 at 10:38:50PM +0200, Manuel Bouyer wrote:
> On Fri, Aug 24, 2012 at 04:26:07PM -0400, Thor Lancelot Simon wrote:
> > > I think in this case you have to flush both: if you flush only the
> > > disks, the data you want to be on stable storage may still be in the
> > > controller's cache.
> > 
> > That doesn't make sense to me.  If you consider the controller cache
> > to be stable storage, then you clearly need to flush only the disks'
> > caches for all the data expected to be in stable storage to actually
> > be in stable storage.
> 
> Immagine the following scenario:
> - wapbl writes to its journal.
> - mfi(4) sends the write to controller, which keeps it in its
>   (battery-backed) cache and return completion of the command
> - wapbl requests a cache flush
> - mfi(4) translate this to a disk cache flush (but not controller cache
>   flush).
> - the controller sends a cache flush to disk. at this time, the data wapbl
>   cares about is still in the controller's cache
> - some time later, the controller flushes its data to disks. Now the
>   data from wapbl is in the unsafes disks caches, and not in the controller
>   cache any more.
> 
> So you still need to flush the controller's cache before disks caches,
> otherwise data can migrate from safe storage to unsafe one.

Will a controller really empty its cache into the attached disks'
caches, or will it issue the disk writes, wait for the disks to
acknowledge that the data is on the platter, and then empty the cache?

I have the following vague idea in mind for how an operating system
should treat disk writes: it seems to me that our disks subsystem(s)
should treat streams of disk writes kind of like TCP sessions in
that the "receiver", which is either an instance of some disk driver
(e.g., sd(4)) or a non-volatile cache, tells the "sender" (some user
process that write(2)s, a filesystem, or the pager) that it is open to
receive up to X megabytes.  The sender sends the receiver X-megabytes'
worth of bufs, but holds onto a copy of the bufs itself until each is
acknowledged.  Ordinarily an acknowledgement will come back saying "you
may go ahead and send me Y more kilobytes, sender".  A sender may also
get a NACK ("sorry, the backup disk was unplugged before it acknowledged
that buffers P, Q, and R hit the media"); then it has to indicate the
exception or else retransmit the buffers.

Here and there in the system you will have software (a filesystem) or
hardware (a battery-backed cache) that "proxies" disk-write streams.  A
filesystem will "proxy" because it's probably going to either serialize
writes (say to write them to a journal) or to augment them (say to
update corresponding metadata).  Typically a filesystem will proxy, too,
because we don't expect for a user process to block in write(2) until
all the bytes written have landed on the platter.  A battery-backed
cache will proxy because it's going to guarantee disk-write completion
to the sender.

I have the following doubt about a battery-backed cache: what if I
yank the disk?  I have never met a controller with battery-backed
cache where I could not pull some of the disks right out of the front
of the chassis.  I guess that usually those disks were redundant,
too.  So, what if I yank two disks? :-) It seems like receivers and
proxy receivers ought to advertise the guarantees that they do and do
not make (e.g., "I guarantee that barring disk-yankage, I will put
your bytes on the platter" OR "barring power failure or disk-yankage
and non-replacement, I will put your bytes on the platter"), and
senders requirements ought to be matched to receivers guarantees when a
disk-write session is established.

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981


Home | Main Index | Thread Index | Old Index