current-users: Re: Disk buffers

Subject: Re: Disk buffers
To: None <current-users@NetBSD.ORG>
From: Christoph Badura <bad@flatlin.ka.sub.org>
List: current-users
Date: 02/25/1997 03:15:00
laine@MorningStar.Com (Laine Stump) writes:
>When I do one make at a time, the disk transfers (shown
>from systat vmstat) stays between 120-350k/sec most of the time while
>still keeping all 5 CPUs almost completely busy, but when I do 2 or more
>makes, the behavior diverges after awhile, with disk stats showing
>1500-1800k/sec while two or more of the CPUs are under 30%
>utilized.

You should also monitor the number of interrupt per second on the disk
controller and the the number of transfers per second and average size
of each transfer.  You can get the latter two with "iostat -d -D" but
iostat won't report on the interrupts for the disk controller.  I'm
also under the impression that iostat's numbers aren't too correct.
I've seen it report an average transfer size larger than the total
amount of data transfered when it reported on a single transfer.

>This leads me to believe that the disk cache is large enough
>that all the .h files remain in the cache from one cc to the next (and
>for all the simultaneous cc's on multiple machines) if there is a single
>build, but doing multiple makes puts just enough data through that stuff
>starts to get flushed just before it is needed again.

It might also be caused by internal waste in the buffer cache.  Last
week I noticed the following on a P150 w/32MB RAM and a IDE disk while
doing a "make build": vmstat reported around 60 xfers/sec for about
150 to 200 K/s.  However I the machine was 50% idle (between 15 and 20
percent were system time).  When I started a "make depend;make" on a
kernel in parallel the machine started to crawl as if it were
thrashing.  vmstat reported lots of free pages but the disk was busy
as during "make build" running alone.  I interpret that as the disk
being maxed out with IO requests.

I remade a kernel but with "options NBUF=1024" (it was using 430
buffers before).  Note that I didn't increase the amount of memory
allocated to the buffer cache -- only the number of buffers.  After
booting the new kernel, performance improved considerably.  I can now
run both a "make build" and a "make depend;make" on a kernel in
parallel without the machine starting to crawl.  Also, the interrupt
load and the number of transfers per second were cut in half.  I get
more than 50% cpu utilisation too.  Sometimes even 100% which I didn't
get before.

I suspect something's wrong with the sizing of the buffer cache.

There have been several changes since 4.3BSD that haven't been
reflected in the sizing of the buffer cache:

On the VAX you'd get one buffer per 2KB of memory allocated to the
buffer cache.  On the i386 you get one buffer per 4KB.  I.e. you get
half the number of buffers.

On the VAX buffers could have their memory reassigned to other buffers
with 1KB granularity.  On the i386 the granularity is 4KB.  I.e. if
you read in a 1KB file on the VAX it could have been cached in a 1KB
buffer.  On the i386 it takes a 4KB buffer, because you can't remap
less than a page.  However, that means that 75 percent of that buffer
are wasted internally.  If you're caching a large number of small
files you can waste a large part of your buffer cache this way.

Worse, in 4.3 the buffer cache was tagged by device and physical block
number, meaning that fragments from different files could share a
buffer, while in 4.4BSD the buffer cache is tagged by vnode and
logical block number, meaning that fragments from different files
can't share a buffer.

This all seems to indicate that you need a larger buffer cache on
4.4BSD, but at least for the i386, this isn't reflected in the code
that sizes the buffer cache.

Unfortunately, the buffer cache code and the code in vfs_cluster.c
doesn't seem to have ever been instrumented.  If someone has
suggestions on how to instrument it, I'd be interested.

>So my question is - which knobs do I tweak to give myself more disk
>cache? And do I need to do it just on the NFS server, or on the clients
>as well?

You can frob NBUF and BUFPAGES.  As a first guess, I'd say increase
BUFPAGES by 50 percent and then hardwire NBUF to be twice as much as
what the kernel would allocate.  You probably want to do this on the
clients and the server.

If you can gather perfomance data, please send it in my direction.
I'm most interested in this data.

>On a slightly different note - during a pissing contest with a coworker
>who is using FreeBSD for similar stuff (he only has one machine though -
>I win!), I looked at build time with a single threaded make and saw
>that, although user time for the entire make was within 5%, the system
>time for mine (NetBSD 1.2 release) was nearly 40% more than his (FreeBSD
>2.1.6).

Of the top of my head, I'd blame that mainly on the VM system and the
rest on their changes to the buffer cache.
-- 
Christoph Badura	bad@flatlin.ka.sub.org

You don't need to quote my .signature.  Everyone has seen it by now.
Besides, it doesn't add anything to the current thread.