On Tue, Dec 09, 2008 at 04:01:56PM -0500, Thor Lancelot Simon wrote: > On Tue, Dec 09, 2008 at 12:56:27PM -0800, Jason Thorpe wrote: > > > > On Dec 9, 2008, at 12:23 PM, Manuel Bouyer wrote: > > > >> Sure. And you don't care much about command completion either, as > >> long as the write to disk happen in order. > > > > And the only way to ensure that is using FUA (or explicit cache > > flushing). > > But to enforce that using FUA, you have to set FUA on *every* write. > > And I do not understand the "only way" in your remark above. Under > the constraint "if WCE is set in the cache control page" I agree that > it is correct. But as far as I can tell, if WCE is *not* set, it is > not in fact the case that using FUA is the "only way" to ensure that > writes are committed to stable storage in order -- because the tag > ordering rules require ordered tags to complete, um, well, in-order, > and if WCE is not set, commands are not supposed to complete until > the bits are on oxide. > > I am sure I am misunderstanding something. What? An implicit desire to have it all be performant and an implicit expectation that file data won't be journaled. And how clients will behave. Writing client (user) code that keeps a lot of data in flight isn't easy. It of course can be (and often is) done, but it's not trivial. There are times when other issues make it very hard to up concurrency. Real life has shown that an effective solution is to turn on the disk cache. This fibbing lets the client run faster and it also lets the disk write faster (since it can re-order based on where the head is over the platter). And for issues other than metadata, this fibbing doesn't seem to hurt much. So turning the write cache off is painful. Thus we will be best-served to figure out how to live with it on. Windows can, so we should be able to too. :-) As to why tagging is way to painful, the problem is that tagged queuing only permits one ordering. This isn't too bad with a single process generating metadata. But it becomes VERY painful if you have concurrent allocations going on in different files & parts of the file system. Let's consider allocating blocks to a file that's gotten into the single indirect block. Adding a block means updating the free bit map, the inode itself, and the indirect block. That means one write to the journal, then three writes to non-journal. Let's also say we're using FUA on the metadata writes not to the journal so we can easily know when it's safe to overwrite the journal. I don't think we're really doing this yet.... So adding a block means we have one FUA write, wait for it to finish, then issue three FUA writes and track when they finish. Without tagged queuing, those three writes can complete in any order. The disk can do whatever it feels is best. To be honest, we still haven't won much w/ FUA vs tagged queuing. So now let's make it twenty processes each adding blocks to 20 files. Now we have 20 FUA journal writes (many of which will get burst out in the same update), then 60 FUA writes all over the disk, then 20 more FUA journal writes (for the next batch of writes) and 60 more FUA metadata writes. Each batch of 20 journal writes and the preceeding 60 metadata writes can complete in whatever order works best for the disk. If we use untagged commands + FUA, we can do this quite well. If we do tagging, we come up wiht our own order and suffer if it's not what's best for the disk. To make it even more fun, let's say we have 60 processes each allocating to 60 files, and not all of them end up writing each instant. Say 20 of them write at about the same time. So we have 20 journal writes, then 60 around-the-disk writes. Then we have 20 journal writes and 60 most-likely-other disk writes around the disk. Then we have 20 more journal writes and another 60 writes around the disk. It really doesn't matter what order the now-180 writes complete in, just that they complete. That's the problem with using tagged commands. We will suck when we try to get lots of concurrency out of the disks. And then users will use something other than NetBSD. Take care, Bill
Attachment:
pgpZDbWl2CL4B.pgp
Description: PGP signature