tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: B_ARRIER (addresses wapbl performance?)



On Tue, Dec 09, 2008 at 04:01:56PM -0500, Thor Lancelot Simon wrote:
> On Tue, Dec 09, 2008 at 12:56:27PM -0800, Jason Thorpe wrote:
> >
> > On Dec 9, 2008, at 12:23 PM, Manuel Bouyer wrote:
> >
> >> Sure. And you don't care much about command completion either, as
> >> long as the write to disk happen in order.
> >
> > And the only way to ensure that is using FUA (or explicit cache  
> > flushing).
> 
> But to enforce that using FUA, you have to set FUA on *every* write.
> 
> And I do not understand the "only way" in your remark above.  Under
> the constraint "if WCE is set in the cache control page" I agree that
> it is correct.  But as far as I can tell, if WCE is *not* set, it is
> not in fact the case that using FUA is the "only way" to ensure that
> writes are committed to stable storage in order -- because the tag
> ordering rules require ordered tags to complete, um, well, in-order,
> and if WCE is not set, commands are not supposed to complete until
> the bits are on oxide.
> 
> I am sure I am misunderstanding something.  What?

An implicit desire to have it all be performant and an implicit 
expectation that file data won't be journaled. And how clients will 
behave.

Writing client (user) code that keeps a lot of data in flight isn't easy. 
It of course can be (and often is) done, but it's not trivial. There are 
times when other issues make it very hard to up concurrency. Real life has 
shown that an effective solution is to turn on the disk cache. This 
fibbing lets the client run faster and it also lets the disk write faster 
(since it can re-order based on where the head is over the platter). And 
for issues other than metadata, this fibbing doesn't seem to hurt much.

So turning the write cache off is painful. Thus we will be best-served to 
figure out how to live with it on. Windows can, so we should be able to 
too. :-)

As to why tagging is way to painful, the problem is that tagged queuing 
only permits one ordering. This isn't too bad with a single process 
generating metadata. But it becomes VERY painful if you have concurrent 
allocations going on in different files & parts of the file system.

Let's consider allocating blocks to a file that's gotten into the single
indirect block. Adding a block means updating the free bit map, the inode 
itself, and the indirect block. That means one write to the journal, then 
three writes to non-journal.

Let's also say we're using FUA on the metadata writes not to the journal 
so we can easily know when it's safe to overwrite the journal. I don't 
think we're really doing this yet....

So adding a block means we have one FUA write, wait for it to finish, then 
issue three FUA writes and track when they finish.

Without tagged queuing, those three writes can complete in any order. The 
disk can do whatever it feels is best.

To be honest, we still haven't won much w/ FUA vs tagged queuing.

So now let's make it twenty processes each adding blocks to 20 files.

Now we have 20 FUA journal writes (many of which will get burst out in the 
same update), then 60 FUA writes all over the disk, then 20 more FUA 
journal writes (for the next batch of writes) and 60 more FUA metadata 
writes.

Each batch of 20 journal writes and the preceeding 60 metadata writes can 
complete in whatever order works best for the disk. If we use untagged 
commands + FUA, we can do this quite well. If we do tagging, we come up 
wiht our own order and suffer if it's not what's best for the disk.

To make it even more fun, let's say we have 60 processes each allocating 
to 60 files, and not all of them end up writing each instant. Say 20 of 
them write at about the same time. So we have 20 journal writes, then 60 
around-the-disk writes. Then we have 20 journal writes and 60 
most-likely-other disk writes around the disk. Then we have 20 more 
journal writes and another 60 writes around the disk. It really doesn't 
matter what order the now-180 writes complete in, just that they complete. 

That's the problem with using tagged commands. We will suck when we try to
get lots of concurrency out of the disks. And then users will use
something other than NetBSD.

Take care,

Bill

Attachment: pgpZDbWl2CL4B.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index