Subject: Hidden dependencies on 64k MAXPHYS
To: None <tech-kern@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 05/07/2006 12:22:26
I have been doing some experiments with 128K MAXPHYS.  They have been much
less successful than the ones I tried about two years ago.  It seems there
are more dependencies on 64k (or less) MAXPHYS than I expected, and in a
few cases more than there used to be.

Here is a short list for the convenience of anyone really working on this
(I don't have time to do much more than just play):

1) genfs_putpages now splodes if it sees more than MAXPHYS worth at once.

2) vfs_bio now knows about the minimum and maximum buffer sizes for metadata
   I/O.  The code that initializes the buffer pools needs to get smarter;
   but, moreover, we probably need to cap metadata object sizes below the
   maximum data transfer size (that is, support running with MAXBSIZE <
   MAXPHYS) if we're going to ever allow large transfers for contiguous
   data (Imagine doing an LFS segment write of 1MB with one SCSI or LBA48
   write command.  There are very good reasons to do that, but they don't
   imply that you want buffer pools for each power of two from 512 to 1MB!)

3) All large reads and writes in the filesystem are now basically done by
   the page clustering code.  Transfers are limited to 64K by some really,
   really unobvious and ugly magic shifting of bits.  This is because...

4) For explicit I/O with read() and write(), we discard the transfer size
   requested by the user on the way down to the pager, which actually does
   the work.  Instead, we prefault *one* page from the request, which is
   turned into two pages on the way down through UVM; but for a 1MB read,
   we still rely on readahead to magically bring in the first 64K, then
   future faults to bring in each next 64K.

Problems 1 and 2 are new; problems 3 and 4 are old.  Fixing problem 4 looks
ugly but it is really just a SMOP (handing more pages down, maybe with
WILLNEED, if the user says to) once we consider the implications of doing
so.  This would remove the need to read or write MAXPHYS all the time, which
would make it easier to solve #3; though we'd need an algorithm, then, to
decide _when_ to do readahead, and how much to do; now we get away with being
stupid.

There are probably other issues.  These are the ones I noticed in a quick
try to get LFS writing segments in one transfer.

LFS is a good place to start working on this because problem 3 won't
prevent large writes from being generated; LFS will generate writes up to
the maximum allowed size, when writing segments, with simple changes.

-- 
  Thor Lancelot Simon	                                     tls@rek.tjls.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."      - H.L.A. Hart