Subject: ioflush is taking excessive cycles on a big-memory machine
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 04/04/2005 19:50:52
Something I've done has got my big AlphaServer (with 16GB of RAM) into a
state where the [ioflush] thread is chewing up a minimum of "25%"
(according to 'top') of the (a?) CPU, and at times it jumps up to 50% or
higher.  This also makes the machine very sluggish -- just typing to the
command-line can have multi-second lag between keystroke and echo (with
command-line editing).

ioflush has now, after ~50hrs uptime, accumulated more CPU time than any
other long-running process (including even ntpd).

load averages:  1.65,  1.62,  1.82                                     18:54:49
83 processes:  82 sleeping, 1 on processor
CPU states:  0.0% user,  0.0% nice, 37.6% system, 10.4% interrupt, 52.0% idle
Memory: 7189M Act, 4120M Inact, 2112K Wired, 48M Exec, 11G File, 1053M Free
Swap: 4000M Total, 4000M Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
   13 root      18    0     0K 1351M syncer/2 662:43 50.20% 50.20% [ioflush]
 5202 woods      3    0  1104K  888K ttyin/3    8:30 37.30% 37.30% systat
 5212 root      48    0    42M   42M CPU/3      3:10 23.34% 23.34% top
25675 root       2    0  2584K 1784K select/1  13:31 19.09% 19.09% sshd2
  177 root      18  -12  1280K 1288K pause/1  289:17 15.33% 15.33% ntpd
  111 root       2    0   632K  496K select/3  22:43  8.01%  8.01% syslogd
  126 dns        2    4    94M   91M select/1  74:07  4.49%  4.49% named
  272 root       2    0  3016K 2728K select/2 140:24  4.10%  4.10% httpd
 9710 cyrus      2    4  4504K  664K select/2  36:43  2.78%  2.78% idled
  190 root       2    0   224K  184K netio/1   10:07  1.46%  1.46% rwhod
 7259 root       2    0    16M   14M select/3  13:41  0.78%  0.78% emacs
  317 root       2    0  2560K 1808K select/0  20:47  0.10%  0.10% sshd2


This is after rsync'ing about 750k files (1.2GB) from another machine
and then moving them around between different filesystems a couple of
times.

However the copying was all done over 10 hours ago and all the dirty
buffers do seem to have been flushed by now.

Typing 'sync' causes the machine to really bog down for about a minute,
but as far as I can tell from systat or vmstat not much I/O happens.  It
just seems to take that long to scan for dirty buffers.  Also, no matter
how many times I run 'sync', the ioflush thread stays just as busy.

Now I currently have vm.filemax set rather high in hopes of caching as
much file data as possible, since the first shuffling of files was a
trial run and I need to do it all over again.

# sysctl vm | sort
vm.anonmax = 60
vm.anonmin = 40
vm.execmax = 50
vm.execmin = 40
vm.filemax = 60
vm.filemin = 5
vm.loadavg: 1.39 1.59 1.78
vm.maxslp = 20
vm.nkmempages = 32768
vm.uspace = 16384

As hinted by 'top' you'll see this did have the effect I was hoping for
with 11G of file cache used (though I don't know why that's almost an
order of magnitude higher than the on-disk size....).  This usage is
confirmed with 'systat buf' (which has the patches to display all these
numbers):


                    /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   ||||||||

     16384 metadata buffers using              1676960 kBytes of memory (10%).
   1416345 pages for cached file data using   11330760 kBytes of memory (69%).
      6162 pages for executables using           49296 kBytes of memory ( 0%).
     48314 pages for anon (non-file) data       386512 kBytes of memory ( 2%).
    139709 free pages                          1117672 kBytes of memory ( 7%).

File System          Bufs used   %   kB in use   %  Bufsize kB   %  Util %
/home                    14756  90       29784   2     1510104  90       2
/                         1227   7       19282   1      125704   7      15
/var/spool/imap            306   1         834   0       31416   2       3
/usr/pkg                    41   0         222   0        4200   0       5
/var                        35   0         256   0        3608   0       7
/var/log                    10   0         264   0        1016   0      26
/tmp                         1   0           1   0         104   0       1

Total:                   16376  99       50643   3     1676152 100       3


(Note all those filesystems, except the MFS /tmp, are mounted with soft
dependencies enabled)

However actually using this many filecache pages seems to have
overwhelmed ioflush -- it seems to need "too many" cycles now just to
discover that there's nothing for it to do.

I'm going to reduce vm.filemax to maybe 25% and vm.filemin to 1%, but I
don't hold much hope at being able to actually limit the filecache.
(It seems a reboot is the easiest way to free the filecache :-)

A long time ago this post was made suggesting a way to force ioflush to
calm down by increasing the length of time the thread sleeps:

	http://mail-index.netbsd.org/port-mac68k/2005/01/19/0002.html

I doubt that'll help here though since it seems the "current work" is
taking longer than a second to complete even when there's no real work
to do.  And of course it won't make the search algorithm any faster
either....

Has anyone been able to profile ioflush (with softdep) recently?

Any other ideas how I might avoid this overhead in production?

-- 
						Greg A. Woods

H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>