tech-kern: Unified Buffer Cache 1st snapshot

Subject: Unified Buffer Cache 1st snapshot
To: None <tech-kern@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 09/20/1998 21:07:12
hi folks,

herein are contained some items concerning the unified buffer cache
("UBC") code that I've been working on for a while now:
a disclaimer, a description, and a diff.

first the disclaimer:
this is probably about alpha-test quality code.
it boots multiuser on my sparc1 and basic stuff works.
YOU DON'T WANT TO TRUST ANY REAL DATA TO THIS CODE,
IT HASN'T BEEN TESTED TERRIBLY THOROUGHLY.

several important features are missing, and lots of performance
improvements are needed before this will be useful.  we need to change
a uvm interface or two before this will work at all on a non-sparc
(specifically, there needs to be a way to specify an alignment
to uvm_map()).


next the description:
this unified cache design is structured similarly to solaris:
regular file data lives only the page cache, everything else
(indirect blocks, directories, global meta-data) lives only in the
buffer cache.  (for those of you who remember my pagecache-uber-alles
design, I gave up on that as being too much work for too little gain.
plus, it'd be nice to actually have this be finished someday.)

the interfaces for the kernel to access the page cache are ubc_alloc()
and ubc_release() (ala segmap_getmap() and segmap_release()).
you get a mapping onto the part of the file the user wants to change,
do a uiomove() to copy the user's buffer in, and then release the mapping.
mappings are cached in an LRU fashion.

there are new VOP interfaces, VOP_GETPAGES() and VOP_PUTPAGES(),
with the definitions mostly lifted from FreeBSD.  these take an array
of pages and perform the requested i/o on them.  pretty straightforward.
I currently have simplistic implementations of these for nfs and ffs.
the other disk-based filesystems should be trivial, the pass-thru
filesystems will need some thought.  (and actually it occured to me
just yesterday that it would be better to move page allocation from
uvn_get() into VOP_GETPAGES() to make things much easier for nullfs
and similar filesystems.  these will no doubt evolve some more.)

I tried to ifdef everything so that I could still build non-UBC kernels
from this code but I got kinda slack about that after a while so there
are no doubt places that I've missed.  I didn't know whether eventually
this would end up as an option in main tree or as the default on a
branch, so the result of this is that the code is mostly ifdef'd and
pretty messy.  the reason I'm distributing this now in this half-assed
form is to acknowledge that if I try to finish it all myself it'll
take forever, and we'd like it to be done sooner than forever.
this diff also contains my changes for "swapctl -d" and some pagedaemon
improvements that have been awaiting review for a long time.

oh yea, the way to turn on this code is to put "options UBC" in
your kernel config file.


here's the list of stuff that remains to be done (and there's
probably plenty more that I'm forgetting too):

async i/o, readahead, clustering, partial-page stuff, or dynamic
	buffer-cache resizing.  I have ideas, but haven't had time to
	do anything about these yet.  most of this should be fairly
	straightforward (except maybe partial pages).
more work needed for nfs.
	we should probably change processing of attr updates that accompany
	write rpc replies to be deferred until the entire write operation
	is completed.  this will enable us to add multiple pages (or whatever
	unit) to the file in one shot without invalidating the ones that
	weren't part of the first write rpc.
	this would probably also allow us to do locking on nfs nodes.
	also, dirty-region stuff should be put back in, probably
	one region per file.
merge struct uvm_vnode into struct vnode inline, eliminating uvm_vnode
	this will make the code much cleaner and make it easier to clean up
	some stuff related to this that I've been lazy about:
	UVM_VNODE_IOSYNC vs. VBWAIT
	getnewvnode() vs. vget() vs. uvn_attach()
fix handling of partial page after EOF
	in uvn or in ubc?   this stuff mostly depends on making nfs
	play better.
VSIZENOTSET stuff is wrong, it should be removed in favor of resturcturing
	vnode creation to do whatever is needed to learn the size
	before returning the new vnode to the caller.
	alternatively, change the uvm code so we don't need to know the size.
UVM_VNODE_RELKILL should probably be put back in somehow
	this is probably another bit of fallout from vnodes and uvns
	not being merged.  or perhaps we should put in some way to
	disassociate busy vnode pages from dying objects so that we can
	kill the object right away.
remove recursive locking in uvn_io()
	should be replaced with separate locks for read()/write() vs getpages().
add support for the other filesystems
	currently only ffs and nfs supported.
add limits on pagecache size
	some people want to limit the amount of memory that the pagecache
	will use.  (I personally think that letting everything compete equally
	is fine, but putting in sysctl hooks is fine as long as I can
	adjust it to make it do what I want.)  I think that doing what
	Digital Unix does would be fine: hi- and lo-water marks for
	pagecache usage (or really, combined pagecache and buffer cache).




finally, the diff (against -current of this morning):

http://www.chuq.com/netbsd/diff.ubc.980920




questions and comments are welcome, either on this list or privately.
if people would like to write some portion of the code that's left to
be written, that'd be great too.  I'm probably going to continue
doing cleanup-type work on this for a while, so good things for others
to work on would be supporting more architectures and filesystems,
and if someone would like to take on readahead, that'd be fabulous.

-Chuck