current-users: Re: Suggestions on CCD Interleave (also, large-block large-cyl UFS)

Subject: Re: Suggestions on CCD Interleave (also, large-block large-cyl UFS)
To: None <current-users@netbsd.org>
From: Alexis Rosen <alexis@panix.com>
List: current-users
Date: 01/31/1999 14:35:52
cjs@cynic.net (Curt Sampson) wrote:
> Has anyone done any benchmarking work on ccd interleaves to determine
> which ones are better and worse? I notice that the ccdconfig manual
> page gives 16 as an example interleave, but I've found that powers
> of two tend to concentrate inode activity on one drive (I've seen
> cases where I untar a file and one drive in the ccd is quite active
> while the other is virtually idle). I've been using 96, (which is
> about 20% faster than 32 when untarring a large directory tree with
> a lot of small files in it), but haven't really done a lot of
> experimentation yet to see if perhaps prime interleaves or something
> like that would be more optimal.

I did some benchmarks a couple of years ago to see what would work best
for a traditional news system (ie, lots of small-file writes and reads).

To my surprise, really big interleaves were best, at least for reads. And,
like Curt, I saw that non-power-of-two sizes were a win, though I didn't
figure out why or pursue that as strongly as I should have.

Here's a copy of a posting that I never made to news.software.nntp for
reasons not worth going into... Note that my theory about the track cache
effect seems, in retrospect, totally dumb, and if I'd had time to think
about it more I hope I'd have come up with something better. Also note that
the time difference between 126 block and 2048 block interleaves is pretty
minimal, so this suggests two things:
 - a 2046 block interleave might have been an even bigger win
 - if 2046 blocks doesn't win more, it pays to stick with 126. Even if it
   does, 126 may be a better choice.
The second conclusion has to do with writes. There is a big penalty you pay
when writing larger files on CCDs with large interleaves: You don't get the
benefit of writing to multiple disks simultaneously. For that, you need a
size small enough that pending writes can be cached by the disk, so the
write call can return and the application can proceed on to writing the
next portion of the file.

BTW, on a related topic, has anyone tried to stress a UFS with blocksize 32k,
really large CGs, and maxcpg 100%? In theory, that should produce the best
results for extremely large files, I think. But I vaguely recall running
into serious trouble when I tried this.

/a
---
Alexis Rosen   Owner/Sysadmin,
PANIX Public Access Unix & Internet, NYC.
alexis@panix.com

--------------------------------->% cut here %<-----------------------------
I've been following this thread with some interest. I spent a little time
doing some tests. I wrote some fairly trivial scripts to produce a filesystem
that is a moderately reasonable simulation of a news filesystem (~1.2 million
files in 390 directories, mixed evenly between 930, 3030, and 6102 byte
files). There are obviously ways in which this differs from a real news
spool but it seemed reasonable for testing purposes.

I set up a p-133 with 32MB running netbsd-1.2 current with three wide 2GB
scsi disks on a single controller for testing and another IDE drive just to
hold the OS and scripts. No swapping needs to be done for these tests so the
IDE disk shouldn't really figure in at all.

Every test was run on a newly formatted filesystem.

I used ccd to stripe the SCSI disks using a variety of interleaves. The first
thing I found out was that there was no significant difference between
filesystems with 4k and 8k block sizes. In fact, the 8k block size tested
out as marginally (~1%) faster. This surprised me a bit, since you could
argue that my test was in fact slightly skewed in favor of the 4k block size
(note the sizes chosen for the files).

After deciding that blocksize variants weren't worth testing, I varied the
interleave size from 8 to 32768 blocks- ie, half an fs block up to an entire
cylinder group (cylinders are 1M, 2048 blocks, and cgs are 16 cyls). The
test consisted of 30 readers simultaneously trying to read 1000 files in
sequence from the test tree. Each reader read a different set of 1000 files,
but each test chose the same 30 lists of files to read. Here are the times
to complete:
int.	time	notes
-----	-----	---------------------------------------------
8	lost, but slower than everything else
32	26:17
126	20:40
128	26:06
2048	19:01
8192	19:34	filesystem cylider group size was 4 cyls instead of 16
32768	20:20	second test: 20:09
32768	20:07	kernel profiling turned on (?!)

Several tests were repeated. In some cases, there was *no* variation between
runs. In others, the variance was a few seconds. However, with the interleave
at 32k blocks, kernel profiling caused the time to improve by 13 seconds. I
won't begin to pretend to understand that... (Note that a second test I
performed later didn't show this discrepancy, so it's probably meaningless.)

There were several surprises here. The biggest was the improvement by going
from 128 block to 126 block interleave. I used 126 because that was the
physical track size reported by the disk. The reason I find this so
surprising is that most modern SCSI disks are ZBRed and the track size is
a fictional average over all the zones of the disk. So why this *huge* win?
I don't know, but I'll theorize that it might be related to me amount of
data that the disk reads and caches each time it reads (the "track cache").
If the track cache is 126 or 127 blocks, a 128 block read is severely
suboptimal - 252 or 254 blocks actually get read, and there's typically no
readahead win because of the random nature of the reads. If the disk's
controller is reporting a track size of 126 blocks, perhaps it's actually
using that size for the track cache.

The other surprise was that "huge is good, but huger isn't better". In other
words, using an interleave of one cylinder (1MB, 2048 blocks) yielded
the best times of all, but using an interleave the size of a cylinder group
(either a smaller than normal 4-cylider group or a regular 16-cylinder group) 
was slower (although faster than any times for "small" interleaves).

This is actually a Good Thing because other sorts of operations (long reads
and writes, which this test doesn't measure, and which are a significant
minority factor in news spool performance) benefit much more from smaller
interleaves.

So...