Subject: Limitations of current buffer cache on 32-bit ports
To: None <tech-kern@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 07/16/2002 12:13:53
While building the new AnonCVS server, we noticed a few minor nits which
impact 32-bit machines with very large memories (e.g. our 3.5GB x86 anoncvs
box).  Probably the most significant was that with a very large page cache
and a sufficient number of vnode cache entries, we still do some entirely
unnecessary disk I/O because of limitations of the traditional buffer cache,
which is used for filesystem metadata (e.g. directories).

In particular, the AnonCVS repository contains about 30,000 directories.
Thus, a traversal of the entire tree will involve several thousand I/O
operations unless NBUF is at least 30,000.  Worse, /anon-root/tmp (which
is used for temporary files for CVS checkouts) will contain at least a
significant subset of the directories in /anon-root/cvsroot; this filesystem
is mounted "async", but does a significant amount of I/O nonetheless as
directories must be flushed to free buffers in the buffer cache.

The buffer cache consumes NBUF * MAXBSIZE bytes of KVA space; thus NBUF
must be limited to prevent KVA exhaustion.  By default, on the i386 port,
there's a hard limit of 6144 buffers.  Since the buffer cache is no longer
involved in clustering filesystem I/O, we can still cluster beyond MAXBSIZE
even if it's reduced (I *think* all the right places enforce a MAXPHYS limit
on transfers, not MAXBSIZE; this would be worth another check, though) so
after thinking this over a bit, I reduced MAXBSIZE on the machine to 32768.

(some confirmation that we're still doing 64K xfers to the disks would be
good, but we appear to be.  If not, I suppose I should go through and fix
MAXBSIZE to MAXPHYS wherever it still needs to be fixed)

This got us 12,288 directories cached.  Not much better.  Worse, however,
is that the geometries of many modern disks indirectly force FFS to use
very large blocksizes; unless I misread the code, it's not safe to reduce
MAXBSIZE below the filesystem block size -- which is, in this case, 32K
at a minimum.  Thus 12,288 directories cached is the best I can do; and
we will continue to do unnecessary I/O for this reason.

There are, obviously, very-large-memory workloads for which this won't
be a limitation, such as database servers or FTP or web servers serving
up large files.  However, on many other machines it's a significant
limitation.  In this case, even with a temporary filesystem mounted
"async", we still *wait* for most I/O to that filesystem, because we
have to flush buffers before we can reuse them.  This probably similarly
impacts softdep performance.

It would seem that there are multiple avenues of attack to be pursued
here.  To begin with, the filesystem shouldn't force such a large
blocksize on us (though in this case, I'd need 8k blocks).  This would
require incompatible FFS layout changes to fix, however -- or not using
FFS.  Unfortunately, we don't have tools to build ext2 filesystems, and
LFS doesn't play nice with the merged buffer cache.  A hack is possible
involving automatically choosing "geometries" with small "cylinders"
when writing the disklabel to a new disk, but this has other problems,
including the potential for bad interactions with the BIOS in the boot
process.  Also, perhaps metadata should not be required to fit in single
buffers; then MAXBSIZE could be reduced greatly.  Finally, unless some of
the comments in the NFS code are wrong, there's still a MAXBSIZE restriction 
on NFS xfers, which would mean that NFS performance would suffer greatly if
MAXBSIZE were reduced.

Basically, ISTM there are two problems.  1) The use of fixed-size buffers
allocated in kernel space for all metadata.  2) Filesystem limitations
that force that fixed size to be quite large.

How do other systems which retain a "traditional" buffer cache for
metadata deal with this issue?

-- 
 Thor Lancelot Simon	                                      tls@rek.tjls.com
   But as he knew no bad language, he had called him all the names of common
 objects that he could think of, and had screamed: "You lamp!  You towel!  You
 plate!" and so on.              --Sigmund Freud