netbsd-help: Interpreting system activity (or troubleshooting performance)

Subject: Interpreting system activity (or troubleshooting performance)
To: None <netbsd-help@netbsd.org>
From: Jeremy C. Reed <reed@reedmedia.net>
List: netbsd-help
Date: 02/22/2001 13:16:05
The vmstat and iostat manpages say to also see the sections starting with
``Interpreting system activity'' in Installing and Operating 4.3BSD. Where
can I find this?

I do have the 4.4BSD version, but it doesn't have this section; it does
have a section (6.5) "Monitoring system performance", but it doesn't help
much.

I am trying to figure out why one of my servers which averaged a .08 load
for several months is now averaging over 1.0 for most of the day (and
maybe longer). I have heard that load averages don't mean or represent
much, but I am interested in the sudden change.

The system activity is very little[1], but the load average is now similar
to when it was slashdotted a few times. This 1.4.2 i386 system hasn't been
changed. Same binaries for almost a year. Uptime is 183 days.

To troubleshoot this, I have done:

top -- to quickly see any high CPU or memory usages
ps auxwwwww -- to further look for any abnormal processes
looked at my apache logs to see if there was any major activity (which I
would have seen via ps anyways)
vmstat[2]
systat
systat vmstat
iostat
fstat
netstat

But I don't have any numbers to compare with (other than from other
different systems).

Also, I don't know what should be considered good or bad amounts; for
example for vmstat:
procs r, b, w; page flt, re, pi, po, fr; disks; faults in sy;
cpu us, sy, id. (How would I know what is good or bad?)

I do know with one problem system with a load average above 8, it had:
 2 processes waiting for run time
 8 blocks per second received
 328 interrupts per second
 374 context switches per second
 82 % user use of CPU time

I read this interesting article[3] about using sar, netstat, lockstat and
top to find system bottlenecks. But the article was vague when it came to
numbers to compare with. For example, it used words like: "heavy %usr and
%sys", "%wio is high", "many pgscan/s", "relatively low numbers if not
zero" and "High numbers" -- but no real numbers.

Any ideas what "heavy", "high", "many" and "relatively low" mean? Any
examples?

Or is the only way to document the vmstat (and other) results when the
system is good and then use it to compare later?

Are any numbers or percentages from other entirely different systems
vmstat results useful for comparing with another system? (If so, which
information?)

Can anyone teach me what I tools I should use and specifically what I
should look for? (What is useful? What is important to notice?)

Also, can anyone explain "context switches" in relation to performance?

   Jeremy C. Reed
   http://www.reedmedia.net/

[1]
bsdtoday:~$ vmstat -c 5 -w 5
 procs   memory     page                       disks         faults   cpu
 r b w   avm   fre  flt  re  pi   po   fr   sr c0 s0 s1 f0   in   sy  cs
us sy id
 0 0 0  5780945072    2   0   0    0    0    0  0  0  1  0  111   24   6
0  0 100
 0 0 0  5780945072    1   0   0    0    0    0  0  0  0  0  117   12   3
0  0 100
 0 0 0  5780945072    1   0   0    0    0    0  0  0  0  0  105   12   3
0  0 100
 0 0 0  5780945072    1   0   0    0    0    0  0  0  3  0  107   14   4
0  0 100
 0 0 0  5780945072    1   0   0    0    0    0  0  0  0  0  105   12   3
0  0 100

[2] My vmstat has some reporting problems, for example look at the memory
in the example above -- the average and free is merged. My dmesg
reports: "avail mem = 994770944".

Also, when I use vmstat with the disk name it still shows the other disks;
for example, "vmstat sd1" and "vmstat /dev/sd1" still shows: "sr c0 s0 s1
f0" when it should just display "s1". Plus it doesn't error if I use
"JUNK" as the name, so I don't even know if this "vmstat [ disks ]" works
at all.

I need to send-pr these.

[3]
http://www.sunworld.com/sunworldonline/swol-10-2000/swol-1013-bottlenecks.html