port-pmax: Re: Performance dive

Subject: Re: Performance dive
To: Toru Nishimura <nisimura@itc.aist-nara.ac.jp>
From: Chuck Silvers <chuq@chuq.com>
List: port-pmax
Date: 03/18/2001 15:37:26
hi Tohru,

I can confirm there's some degradation from 1.5, but it's not as bad
for me as you're reporting.

running bonnie on my 5000/200 with a 1995-vintage 8GB micropolis disk:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU

today's -1-5:
          256   947 99.4  3084 48.3  1024 28.7   888 99.7  2650 35.2  51.1  7.1
452.259u 277.027s 18:20.86 66.2%        0+0k 16404+86655io 7pf+0w

today's -current:
          256   793 98.5  2390 96.5   976 45.1   805 98.0  2399 65.0  55.4 14.5
459.653u 499.222s 20:26.84 78.1%        0+0k 16+218io 61pf+0w


from this I would say we've become cpu-bound for large writes,
and we're using a lot more cpu in gneral.

I did some profiling runs on large sequential writes, and here are
the functions using more than 2 seconds of cpu in my test:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    

-1-5:
 50.18     88.46    88.46                             memmove
 15.34    115.49    27.03                             htonl
  8.21    129.97    14.48      126 114924.36 115591.09  mips_idle
  4.64    138.14     8.17                             _mcount
  2.18    141.98     3.84   377824    10.16    10.16  __qdivrem
  1.71    145.00     3.02   128066    23.55    53.35  ffs_blkpref
  1.54    147.71     2.71  1910709     1.42     1.42  _splset
  1.28    149.96     2.25    41210    54.60   174.31  ncr53c9x_intr

-current:
 26.15     89.73    89.73                             memcpy
 21.82    164.61    74.88                             _splset_noprof
  5.60    183.81    19.20                             __mcount
  5.53    202.79    18.98                             _mcount
  5.33    221.08    18.30   481581    37.99    37.99  memset
  4.73    237.32    16.23   255215    63.61    63.61  mips1_FlushDCache
  2.31    245.26     7.94   208315    38.12   212.97  genfs_getpages
  2.20    252.80     7.55   384757    19.61    62.52  pmap_enter
  1.75    258.80     5.99                             _splraise_noprof
  1.38    263.53     4.73   128066    36.97    61.54  ffs_blkpref
  1.28    267.91     4.38   804991     5.44     5.44  lockmgr
  0.82    270.72     2.81   128602    21.87   266.03  uvm_fault
  0.69    273.08     2.36    64035    36.85   308.30  ubc_alloc
  0.68    275.41     2.33    56782    41.00   158.77  ncr53c9x_intr
  0.68    277.73     2.32   593161     3.92    35.59  pmap_remove
  0.67    280.03     2.29   192043    11.94   235.17  ubc_init
  0.61    282.11     2.08    72112    28.87    56.06  uvm_map
  0.59    284.13     2.02  1593440     1.27     1.54  siiintr


it appears that somebody has interrupts blocked during a lot of the
profiling samples, but we can't tell who.  undoing the most recent change
to arch/mips/include/profile.h gives more useful information:

 27.51     86.72    86.72                             memcpy
 20.02    149.82    63.10                             __mcount
  5.98    168.66    18.84                             _mcount
  5.67    186.54    17.88   481456    37.14    37.14  memset
  4.93    202.09    15.55   254910    61.00    61.00  mips1_FlushDCache
  4.92    217.62    15.52  4506101     3.44     3.44  _splset
  2.46    225.37     7.75   208336    37.22   228.37  genfs_getpages
  2.35    232.78     7.41   384781    19.25    67.52  pmap_enter
  1.49    237.49     4.71   128066    36.79    67.37  ffs_blkpref
  1.40    241.89     4.41   805046     5.47     5.47  lockmgr
  0.92    244.80     2.91   128613    22.63   282.16  uvm_fault
  0.85    247.48     2.67  4105781     0.65     0.65  _splraise
  0.77    249.89     2.42    56281    42.96   168.49  ncr53c9x_intr
  0.73    252.19     2.29    72122    31.79    59.70  uvm_map
  0.72    254.45     2.27   192041    11.80   249.56  ubc_init
  0.71    256.70     2.25   593612     3.78    37.01  pmap_remove
  0.68    258.84     2.14    64035    33.37   324.01  ubc_alloc

and now we can see that our mystery interrupt-blocker was mcount().


next is memset(), which is mostly being called from genfs_getpages()
to zero pages we're allocating for the file.  this is pretty silly,
but I have some ideas on how to avoid it.  I already knew about this
and it's on my list of things to fix.


next is mips1_FlushDCache(), mostly called from pmap_remove_pv() ->
pmap_remove() -> pmap_enter().    pmap_enter() is called about 2/3
of the time from uvm_pagermapin() and the other 1/3 of the time
from ubc_fault().  now this interesting thing here is that pmap_enter()
called from uvm_pagermapin() or ubc_fault() should only ever be overwriting
an invalid mapping, never changing one valid mapping to a different one,
so it's shouldn't be necessary to flush anything.  the calls to pmap_remove()
in pmap_enter() are:

	if (pmap == pmap_kernel()) {
		...
		if (mips_tlbpfn_to_paddr(pte->pt_entry) != pa) {
			pmap_remove(pmap, va, va + NBPG);
			...
		}
		...
		return 0;
	}

	if (mips_tlbpfn_to_paddr(pte->pt_entry) != pa) {
		pmap_remove(pmap, va, va + NBPG);
		...
	}

if the previous mapping was invalid, then we should skip the pmap_remove()
just like we do if the new mapping is the same as the old one.
could you try making that change and see what difference it makes
in cpu consumption?


next is _splset(), which is called from

                0.36    0.00  105696/4506101     uvm_pagefree [55]
                0.44    0.00  128068/4506101     bdwrite [56]
                0.44    0.00  128174/4506101     uvm_pagealloc_strat [54]
                0.44    0.00  128181/4506101     genfs_getpages <cycle 4> [12]
                0.44    0.00  128294/4506101     malloc [88]
                0.44    0.00  128669/4506101     trap [13]
                0.44    0.00  128736/4506101     free [106]
                0.57    0.00  165242/4506101     softclock [72]
                0.71    0.00  205957/4506101     getblk [42]
                0.71    0.00  205970/4506101     bremfree [73]
                0.71    0.00  206054/4506101     brelse [69]
                0.81    0.00  234005/4506101     pmap_page_protect [49]
                0.94    0.00  271678/4506101     dec_3max_init [32]
                1.18    0.00  342811/4506101     uvm_unlock_fpageq [76]
                1.32    0.00  383520/4506101     pmap_remove_pv [22]
                1.33    0.00  384781/4506101     pmap_enter_pv [46]
                2.18    0.00  632441/4506101     uvm_pagelookup [44]
[27]     6.7   15.52    0.00 4506101         _splset [27]

uvm_pagelookup() uses splvm() to protect uvm.hashlock, but I don't think
this is necessary since we shouldn't be changing the identity of pages
in interrupt handlers.  we can try removing all the spls that are protecting
uvm.hashlock.

pmap_enter_pv() and pmap_remove_pv() use splvm() to protect() the pv list.
this wouldn't be necessary if the mips pmap actually implemented
pmap_kenter_pa() the right way.  I haven't seen your new pmap so I don't
know if you've already done something about it.

uvm_unlock_fpageq() uses splvm() to protect the uvm.fpageqlock,
which is actually necessary in the current implementation of the
page free lists since we allow allocation of pages from interrupts.
we could try to improve this but let's leave that for later.

dec_3max_init() is really dec_3max_intr(), I don't know if anything
can be improved there.

pmap_page_protect() uses splvm() to protect the pv list again,
and it's called from uvm_mk_pcluster() and uvmpd_scan_inactive().
in the test that I used for these measurements, each page should be
looked at twice after it's filled with data:  once to write it to disk
in the async flush in ffs_write(), and again to free it in the pagedaemon.
these correspond to the two callers of pmap_page_protect(), in that order.
the rewrite of the pageout code which I'm about to start on will change the
pmap_page_protect() during a pageout into a pmap_clear_modify(), so that
will get rid of those calls.  the pagedaemon will always need to remove
a page's mappings before it can safely free it, so that call will stay.
looking at the case in this particular test again, by the time the pagedaemon
considers a page for freeing, that page should already have no mappings
(since the only mapping was in pager_map, and that would already have been
recycled).  the code in pmap_page_protect() is:

	case VM_PROT_READ:
	case VM_PROT_READ|VM_PROT_EXECUTE:
		pv = pa_to_pvh(pa);
		s = splvm();
		/*
		 * Loop over all current mappings setting/clearing as appropos.
		 */
		if (pv->pv_pmap != NULL) {
			for (; pv; pv = pv->pv_next) {
				va = pv->pv_va;

				/*
				 * XXX don't write protect pager mappings
				 */
				if (va >= uvm.pager_sva && va < uvm.pager_eva)
					continue;
				pmap_protect(pv->pv_pmap, va, va + PAGE_SIZE,
					prot);
			}
		}
		splx(s);
		break;

	/* remove_all */
	default:
		pv = pa_to_pvh(pa);
		s = splvm();
		while (pv->pv_pmap != NULL) {
			pmap_remove(pv->pv_pmap, pv->pv_va,
				    pv->pv_va + PAGE_SIZE);
		}
		splx(s);
	}

in both of these cases, we block interrupts before we know whether there's
any work to do.  we should change this to avoid calling spl* if the pv list
is empty.


the next block of callers of _splset() are the buffer cache and malloc(),
which I'm not going to worry about right now.  after that is
genfs_getpages(), which uses splbio() to protect bufpool.
it would be a lot better if the pool code didn't require the caller
to provide interrupt protection like this.  we can think about this,
but it's not that much of a problem yet.


ok, back to the top level of the stack, the next-highest cpu pig is
genfs_getpages(), which I'll have to look at optimizing.


next is pmap_enter(), which (as I said above) is called about 2/3
of the time from uvm_pagermapin() and the other 1/3 of the time
from ubc_fault().  uvm_pagermapin() itself is pretty bogus,
since most hardware either does physically-addressed DMA or
else uses a separate iommu, so the only reason we have it
is because the device strategy interface uses a virtual address
to specify the memory it should operate on.  we should change
block-based devices to operate on physical pages rather than
virtual address, which is a big project.  this is also on my list
of things to fix eventually, but I wouldn't mind if someone
beat me to it.


next is ffs_blkpref(), which uses 50% more cpu in -current than
it did in 1.5 even though we call it the same number of times.
I'd guess this is TLB- and cache-thrashing, which is also on my list
of things to look at.


next is lockmgr().  1/3 of the calls to lockmgr are to protect pager_map,
so those will go away if we change the device strategy interface as above.
another 1/3 of the calls are to protect the kernel_map around ubc_fault(),
which is pretty silly.  I've put a new item on my list to look into that.
(actually I just got a idea of what to do about it, it shouldn't be hard
to get rid of those calls.)  the other 1/3 protect the block map of the file,
which we need to keep.


next is uvm_fault(), which I think I can eliminate as well.
(the major usage of it in this test, not the function.  :-) )


next is _splraise(), which has identical usage to _splset() above.
the same optimizations apply.


next is ncr53c9x_intr(), which I'll leave to someone else to look at.


next is uvm_map(), which is called almost exclusively by uvm_pagermapin()
in this test.  see above.


next is ubc_init() which is really ubc_fault(), and that should go away
with the same change that eliminates uvm_fault().


next is pmap_remove(), which we mostly already talked about.


last is ubc_alloc(), which we need to keep, but I'll look into speeding it up.



so I think that about covers it.  to summarize, the things we should do are:


1.  avoid zeroing pages in genfs_getpages()			[chuq]
2.  avoid unnecessary calls to pmap_remove() in pmap_enter()	[mips]
3.  remove spl protection from uvm.hashlock			[chuq]
4.  implement pmap_kenter_pa() so that is doesn't use
    pv lists							[mips]
5.  avoid blocking interrupts in various pmap operations
    when the pv list is empty					[mips]
6.  try to speed up genfs_getpages(), etc			[chuq]
7.  change device strategy interface to eliminate need for
    pager_map in most cases					[????/chuq]
8.  reduce TLB and cache thrashing				[chuq]
9.  change the way UBC handles large writes to eliminate
    a bunch of silly work we do now.				[chuq]


so it appears that most of the work is mine, but it'd be nice if you
could take care of the mips-specific bits.  hopefully these changes
for the write case will help the read case as well.

I'd like also revert the change to profile.h that obscured the first
set of -current results, does anyone have any problems with that?

-Chuck


On Sat, Mar 17, 2001 at 11:11:16AM +0900, Toru Nishimura wrote:
> Hi, all.
> 
> I've built -current kernel 1.5S with UBC usage balancing code and ran
> bonnie disk I/O test.  UBC kernel used to be simply unusable with such
> the heavy weight I/O being done.  My MAXINE got stucked waiting for
> something in uvn_xxx() _for ever_.  1.5S kernel behaves differently
> and system is useful this time.
> 
> However, the performance figure disasspointed me.  With 1995 vintage
> 1GB NEC SCSI drive, I got;
> 
>         771579 bytes/second for writing the file
>         1625699 bytes/second for reading the file
> 
> I well remember that write performance used to be 2 times better when
> MI 53C9x SCSI driver was under debug with the hardware. It is worse
> than my AHA1540B (of 1.3.3/i386) with the same HD model. 
> 
> Do people have affirmatives or no?
> 
> Tohru Nishimura