Subject: Re: results from playing around with the new dirpref code
To: None <tech-perform@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-perform
Date: 09/03/2001 13:15:49
Looking at all of Luke's tables, I note that FFS performs *terribly* when
there are a large number of cylinder groups and when there are not enough
vnodes for it to be able to cache data for all of the newly created files/
directories.

This doesn't surprise me.  Running out of vnodes forces directory and file
data writes; having a large number of cylinder groups (because allocation
is spread ~evenly among the CGs) means that to write an arbitrary file
you are far more likely to have to seek.  The interaction between the
insane number of cylinder groups we create on modern disks and the far,
far too small default maxvnodes is truly poisonous; it makes softdep look
particularly bad because it's being forced to write out whole subtrees of
dependencies, seeking the head all over the disk, when a given vnode is
recycled.

The new dirpref code helps, but I can't help seeing it as a bit of a
band-aid.  The real problem is that we are running with an insane number
of cylinder groups, orders of magnitude greater than the designers of the
FFS ever intended.

Worse, with modern disks, because the disk addresses of all of the inodes 
for a CG must fit in the CG head block, with an 8K blocksize we *cannot*
have a reasonable (small) number of cylinder groups.

We should either tune away every single "switch cylinder groups now"
behaviour in FFS that we can find, or kick the default blocksize and cpg.

To get a reasonable number of cylinder groups we need to kick the default
blocksize to at least 16K; we also need to adjust the geometries of some
of our "logical disk" type drivers to make the cylinders themselves bigger
(for example, RAID should use a *multiple of the stripe size*, not the 
stripe size, as its cylinder size) or we can't avoid the problem there at
all.

Think about it this way: a Fujitsu Eagle is 404MB and has 842 cylinders.
That means that if you arbitrarily select a block, there's an (842/16) or
1/52 chance that it's in the same CG as the last block you used.  My
IBM DTLA-305040 is 40GB and has 16383 cylinders -- that means that with our
(insane) defaults there's a 1/1023 chance that a randomly-chosen block will
be in the same group as the previous one.  A cylinder group on the Eagle 
represents 1/52 of the disk; on the IBM it represents 1/1023 of the disk.

The DTLA's average seek time is 9.5ms.  The Eagle's average seek time was
18ms (the Eagle was unusually good for a drive of its era!).  As you can
see, seek time hasn't improved *nearly* as fast as transfer rate (the Eagle
could sustain about 1MB/sec; the DTLA can sustain about 30MB/sec).  So, if
we force two orders of magnitude more seeks, we should expect the DTLA to
handle the workload *slower* than the Eagle would have -- after all, we're
looking at 100 times as many seeks, but they're only *twice* as fast.

A blocksize of 16K lets us use 328-cylinder cylinder groups, which
conveniently gets us 49 cylinder groups, or about the same number as we
would have had on the Eagle.  However, since we can see that for many
workloads seek time has come to dominate transfer rate more than ever
before, we need to do something to reduce the number of seeks further,
particularly since the kernel wastes so much of its time *synchronously
waiting* for seeks to complete.

Track-to-track seek on the DTLA is much lower than on the Fujitsu: 1.6ms
compared to 6ms, or about 0.16 of the average seek time instead of 0.33.
This does suggest that optimizing things like ffs_dirpref to cause short
instead of long seeks will help -- but not as much as causing less seeks,
period.

We should actually benchmark filesystems with many *less* cylinder groups
(32K filesystems) against those with about 50 cylinder groups on modern
disks to see which way we handle real workloads better.  However, I think
there's plenty of evidence to support switching to 16K blocks and as many
cpg as we can get, given the geometry (about 300 for most new disks) right
now.

-- 
Thor Lancelot Simon	                                      tls@rek.tjls.com
    And now he couldn't remember when this passion had flown, leaving him so
  foolish and bewildered and astray: can any man?
						   William Styron