tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Wild CPU usage times on NetBSD 5



While I run NetBSD 4 (or even earlier) most places, and everything
I compile for pkgsrc is compiled for NetBSD 4, on my laptop I have a
NetBSD 5 (5.0.1 actually) kernel installed (and 4.99.38 userland -
it was the current version of -current when I acquired it, in (I
think it was) December 2007).

Everything works correctly (well, except the azalia (or hdaudio),
which I suspect probably needs the magic Dell GPIO pin hack,
which I don't have - but also don't care about much).
(The laptop is a Dell Precision M4300 - or called XPS M4300 sometimes).

However, predicting CPU usage for any of several of my fairly
common tasks is wild - sometimes things go quick, and sometimes
they're slow.   This isn't EST related, while changing the est
frequency alters the times (obviously, if the CPU is running
faster, things happen quicker) but the maximum variation there is
about a factor of 2

I first noticed this with spamassassin (I filter my mail locally
as I process it, not as it arrives at my mail server - why is not
relevant here).   Sometimes it takes a second or so (give or take)
to process each message, other times it takes 20 seconds (this is
basically just bayyesian filtering, and the built in rules, all
network based checks are disabled - I sometimes fetch mail online, then
handle the filtering, generating replies, ... without a network
connection, then connect again to allow replies to be transmitted.)
I'm also using the spamassassin script (perl code) not spamd - that's
just laziness on my part (I set it up this way, and can't be bothered
changing it).   The problem isn't confined to spamassassin or I would
alter it, just to see if it makes a difference.

At first glance you'd think the variation in processing time would relate
to the actual messages being filtered, but it doesn't.   During some
periods (and this is unpredictable) every message gets processed in a
small number of seconds (sure they vary, but usually from < 1 sec to about 2 
or 3 secs).   When things are slow the times also vary, but from
more like 10 secs to 20 secs or so (per message).   Every message,
big or small, gets either fast or slow processing - depending on whether
the system is in its "go slow" mode or not.

This is a dual core CPU, usually when spamassassin is working, nothing
else significant is happening (the other CPU is just idling, doing
nothing at all.)   That's fine, I'm not asking to spread the load...

That was OK, I could live with that (and have done for ages now), then
I noticed another process with similar bi-polar behaviour.

I have a BIG directory into which I save stuff I download from the
internet (I should be more intelligent, and impose some hierarchy,
but again I'm lazy, I would if starting again, but adding hierarchy now
would mean going and moving many thousands of files, by hand...)

Of course, I don't expect operations on that directory to be very quick,
and that's OK, it doesn't get used all that much (and when it does, it
is fast enough - usually).

To manage it, I have a script that reads it, and builds a private
HTML page that is an index (actually several indexes, including one
of the recently added/changed files, which are the ones I'm most likely
to want, another by file type (pdf, html, ...) - it also converts the
file names from my own private encoding of arbitrary (reasonable) names
into "unix safe" file names - that is, for example, the item I want to
see in the index might be

        Fred's rebuttal of "cat -v considered harmful" (version 1)

(no, there isn't such a thing, that I know of...) but in the filesystem,
while that filename would work, and the browser could handle it OK,
when I want to copy the file using regular sh utilities, I'd need
something like apb's quote command (or my more primitive sh function
that's similar) and use it - which makes life much more difficult, so
instead, the file name is (that is, it would be, if it existed)

        Fred,,s_rebuttal_of_,oQcat_-v_considered_harmful,cQ_,opversion_1,cp.pdf

which is a string I can cut/paste into any sh command to copy the
file, or whatever - my script converts that into the string I want
to see in my index (as above - aside from the char changes, also note
the .pdf is removed).   That's just for background, the conversion is
done mostly in sh funcs, using $(var%str} type operations, with "echo"
to write to the output file - ie: essentially everything is built into
the shell, there's very little use of external commands (ls to get
the file names, and sort to order things when needed, but neither of
those matters much here).   There are tens of thousands of files to
be manipulated (I've been collecting for years...) so naturally, all
this takes a while.

OK, it's my problem, I'm the idiot with the huge directory...   The issue
for here is that while the time it takes to generate the index should just
grow slowly (as new stuff gets added and so the index gets slightly
bigger) that isn't what actually happens - the time oscillates wildly
between slow, but acceptable, and glacial (say 5 minutes or so, and about
an hour) - this looks to be just about the same kind of slowdown that
spamassassin gets (somewhere around a factor of 10).

Both of those tasks essentially run in the background, and each occupies
just one CPU (this is not a fault in one core, which CPU is used varies,
it sometimes even switches during the run of a process) so I can more
or less just ignore the problem, and do other stuff while those are
processing.

But I also see the same problem running magicpoint (which I use when I'm
teaching) - most of the time switching from one page to the next goes
just as you'd expect (essentially undetectable delay - I have all the
images, when there are any, pre-cached, so magicpoint doesn't need to
go running prcoesses for scaling etc - when that happens, things do
slow down, but that's not the issue here.)  I am seeing the same behaviour
with magicpoint as with spamassassin and my indexing sh script, sometimes
it takes 15-20 seconds to change to a new page (including when the new
page is just text).   Now this is an X application, so it is interacting
with the X server (the other two aren't), but the X server itself seems
still highly responsive while this is happening (I can create new
windows to try and monitor what's going on ... while waiting for my
new magicpoint page to appear, which all works just fine).

Now I get to explain why all this is being sent to tech-kern - the only
thing that I can see in common between these 3 applications is the kernel,
and the hardware, one is perl, one is sh (it is a NetBSD /bin/sh
script, not ksh, bash, or anything else), the other is C++ - nothing
very common there (I guess they all use libc, but so does everything else
running which isn't being affected, as far as I can tell).

What they have in common is that they each are, at least for a short time,
fairly intensive CPU users - it smells as if something is detecting a
high CPU using process, and just making it go slower - somehow.

According to systat, the CPU time is all being spent in user mode, not
in the kernel (and not niced, though that wouldn't matter, there's no
competition of note, even a nice -19 process would get as much cpu time
as it requests).   There are no unusual number of interrupts (the
interrupt rate seems quite low actually), there's very little I/O
(even during spamassassin, its index files aren't tiny, but it seems
to be fairly effective in not abusing them), during the magicpoint case
(usually) there's essentially no competition at all (the network isn't
usually even configured, I kill dhclient before it gets a chance to
configure anything - not that it would usually succeed then anyway).

I can't find a particular pattern to when this likely to strike - usually
soon after a boot all is fast, but it isn't a case of "start fast, then
get slow" as if something was being consumed (what it would be that would
affect CPU times I can't guess) - after a period of slowness, for no
conceivable reason, everything can just revert to "normal" again (with
no reboots, or anything).  Sometimes the system runs ages in "fast" mode,
other times it can be slow within 15 minutes after booting (I reboot
before I teach - suspend doesn't work for me - that is, (ACPI) suspend works
fine, but the X server can never (re-)start after a suspend, it's an
nv video chip - again, I don't mind that, booting is fast enough - in any
case, the X server needs to learn the existence of the external VGA plug
for the projector, and restarting it seems to be the only way to make it
do that - rebooting is not that much slower than just restarting X.)

Since I can't find anything in common between the applications that I
can blame (rationally) what's left is the kernel and the hardware (the
cpu itself).   The hardware is certainly possible, but it is kind of
hard to tell how the hardware can tell what process is running, and slow
down just that process, while allowing others to continue seemingly unchanged.
I guess it is possible that it might detect a period of intense activity
without any context switches, and deliberately decide to go slow, but
that's kind of stretching rationality.    Other processes seem unaffected,
and which CPU (of the two cores in the one chip) is unrelated to the
problem - once it is in "slow" mode, both CPU's run those processes slowly
(even if the cpu in use switches while the process is running), while
other processes seem unaffected (X server, xterms, systat when I decide to try
and work out what's happening, ...).

If it isn't the CPU, what's left is the kernel - it certainly knows what
process is running, and how much cpy time it is consuming, and it has
control of the stats, so while everything tells me the cpu time is all
user time, if it is the kernel, it could be lying...

Does anyone have any idea what this might be, or even what else I should
try doing to debug this - I can try a current kernel, but I don't really
want to run that, the 64 bit time change is mostly invisible, but to process
accounting it isn't, and with a 32 bit sa, and 32 bit summary files,
processing 64 bit time acct files just makes a mess ...   I have a hack
in my startup scripts that leave accounting off if they detect a 5.99.*
kernel is running, so testing is fine, but I actually make use of the
accounting stuff - in fact I am going to include some "lastcomm" output
below as an example of what I see (the unpredictable nature of this makes
it difficult to know whether "ran current for 4 hours and nothing went
slow" would mean anything or not, if that happened.)

Also, if anyone else sees anything like this, that would be useful to
know (but unless we start seeing lots of "me too" messages, which I doubt,
there's no need to tell me that it does not happen to you!)

My current guess might be something along the lines of some internal state
causing very poor cache utilisation, or something like that, but I don't
know how to measure that.

Thanks,

kre

If you don't want boring details, stop reading now, the remaining half of
this message isn't interesting...   On the other hand, if there was something
you needed to know about my setup, what's below might answer that.

For info, this is what "cpuctl identify 0" tells me ...

cpu0: Intel Core 2 (Merom) (686-class), 1995.13 MHz, id 0x6fd
cpu0: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 0xbfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features 0xbfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 
0xe3bd<SSE3,DTES64,MONITOR,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM>
cpu0: features3 0x20100000<XD,EM64T>
cpu0: "Intel(R) Core(TM)2 Duo CPU     T7250  @ 2.00GHz"
cpu0: I-cache 32KB 64B/line 8-way, D-cache 32KB 64B/line 8-way
cpu0: L2 cache 2MB 64B/line 8-way
cpu0: ITLB 128 4KB entries 4-way
cpu0: DTLB 256 4KB entries 4-way, 32 4MB entries 4-way
cpu0: Initial APIC ID 0
cpu0: Cluster/Package ID 0
cpu0: Core ID 0
cpu0: family 06 model 0f extfamily 00 extmodel 00

(cpu 1 is just the same, of course).  I run it as i386, not amd64.

machdep.est.frequency.target = 1600
machdep.est.frequency.current = 1600
machdep.est.frequency.available = 2200 2000 1800 1600 1400 1200

I run at 1600 (set in sysctl.conf), rather than 2000, as this is a fairly
warm environment, and at high speeds, things seem to get quite hot...

The "2200" in the "available" output looks like a bug, attempting to
set that as "target" works, but "current "just goes to 2000 (this is a
2GHz processor, not 2.2GHz).   I booted a FreeBSD live CD, just for fun
one day, and its equivalent sysctl didn't mention 2200, it told me,
correctly I think, that 2000 was the fastest available speed.   But
that's a very minor bug...

uanme -a result ...

NetBSD epsilon.noi.kre.to 5.0.1 NetBSD 5.0.1 (EPSILON-1.2-20090731) #8: Sat Aug 
 1 02:22:55 ICT 2009  
kre%jade.coe.psu.ac.th@localhost:/usr/obj/5/kernels/EPSILON i386

Last, here's a lastcomm | grep perl, with some blocks of similar
lines edited out to make the result smaller, and with some annotations
added to explain what I see.   perl run on ttyp2 is spamassassin run
in my office (at work I fetch mail manually when I have time to think
about it), perl on no tty at or just after 20 mins past an hour is
mail processing - spamassassin - run from cron, that happens only when
I'm home (the script tests, and just exits otherwise) which fetches
mail when (mostly) the laptop is just sitting around doing nothing much
else...   Any other perl is most likely unrelated.

perl       -       kre      ttyp2      1.67 secs Wed Nov 25 18:54 (0:00:11.81)
perl       -       kre      ttyp2      1.61 secs Wed Nov 25 18:54 (0:00:11.67)
perl       -       kre      ttyp2      1.59 secs Wed Nov 25 18:54 (0:00:11.70)
        [ lines elided here, nothing different, this is the most recent
          incoming mail processing I did, as of when I created this output,
          all fast ... ]
perl       -       kre      ttyp2      1.58 secs Wed Nov 25 18:50 (0:00:11.67)
perl       -       kre      ttyp2      1.59 secs Wed Nov 25 18:49 (0:00:11.67)
perl       -       kre      ttyp2      1.64 secs Wed Nov 25 18:49 (0:00:12.08)
perl       -       kre      ttyp2     12.83 secs Wed Nov 25 18:06 (0:00:23.11)
perl       -       kre      ttyp2     12.92 secs Wed Nov 25 18:06 (0:00:23.12)
perl       -       kre      ttyp2     12.86 secs Wed Nov 25 18:05 (0:00:23.17)
        [ more lines deleted, this lot, around 18:00, was the previous
          time I fetched mail ]
perl       -       kre      ttyp2     13.16 secs Wed Nov 25 18:01 (0:00:23.31)
perl       -       kre      ttyp2     26.19 secs Wed Nov 25 18:00 (0:00:36.44)
perl       -       kre      ttyp2     13.28 secs Wed Nov 25 18:00 (0:00:23.42)
perl       -       kre      ttyp2     13.12 secs Wed Nov 25 17:59 (0:00:23.25)
perl       -       kre      ttyp2     26.30 secs Wed Nov 25 17:59 (0:00:36.53)
perl       -       kre      ttyp2     13.33 secs Wed Nov 25 17:58 (0:00:23.42)
perl       -       kre      ttyp2     14.05 secs Wed Nov 25 17:58 (0:00:24.41)
perl       -       kre      ttyp2     12.95 secs Wed Nov 25 17:58 (0:00:23.20)
perl       -       kre      ttyp2     13.06 secs Wed Nov 25 17:57 (0:00:23.19)
perl       -       kre      ttyp2     12.70 secs Wed Nov 25 17:57 (0:00:22.81)
perl       -       kre      ttyp2     13.95 secs Wed Nov 25 17:56 (0:00:24.08)
perl       -       kre      ttyp2     13.03 secs Wed Nov 25 17:56 (0:00:23.14)
perl       -       kre      ttyp2     12.97 secs Wed Nov 25 17:56 (0:00:23.11)
perl       -       kre      ttyp2     13.80 secs Wed Nov 25 17:55 (0:00:23.98)
perl       -       kre      ttyp2     12.94 secs Wed Nov 25 17:55 (0:00:23.09)
perl       -       kre      ttyp2     10.08 secs Wed Nov 25 17:54 (0:00:20.19)
perl       -       kre      ttyp2     13.62 secs Wed Nov 25 17:54 (0:00:23.73)
perl       -       kre      ttyp2      5.61 secs Wed Nov 25 16:52 (0:00:15.83)
perl       -       kre      ttyp2      4.66 secs Wed Nov 25 16:52 (0:00:12.11)
perl       -       kre      ttyp2      3.67 secs Wed Nov 25 16:51 (0:00:13.77)
perl       -       kre      ttyp2      3.42 secs Wed Nov 25 16:51 (0:00:13.55)
perl       -       kre      ttyp2      4.06 secs Wed Nov 25 16:51 (0:00:13.67)
        [ quite a few lines deleted, this (16:30-16:52) was soon after a reboot,
          after I returned to my office after teaching today, and fetched
          several hours worth of mail - all processed in fast mode ]
perl       -       kre      ttyp2      1.56 secs Wed Nov 25 16:32 (0:00:11.66)
perl       -       kre      ttyp2      1.70 secs Wed Nov 25 16:31 (0:00:11.92)
perl       -       kre      ttyp2      1.66 secs Wed Nov 25 16:31 (0:00:12.23)
perl       -       kre      ttyp2      1.70 secs Wed Nov 25 16:31 (0:00:11.83)
perl       -       kre      ttyp2      1.61 secs Wed Nov 25 16:31 (0:00:11.69)
perl       -       kre      ttyp2     13.33 secs Wed Nov 25 16:30 (0:00:25.41)
        [ No idea why that singleton at 13 secs, whether that was "slow mode"
          or just one weird message spamassassin had problems with, that one,
          the very first message processed in this block, is odd - but I'd
          treat it as an outlier, and ignore it...
          Between 13:00 and 14:30 and 15:00 and 16:00 I was teaching,
          the system shutdown and rebooted three times, shutdown, move,
          teach, shutdown, move, teach, shutdown, move to office, and reboot
          magicpoint was (sometimes) very slow - especially late in the
          first class ]
perl       -       kre      __         1.66 secs Wed Nov 25 12:20 (0:00:01.69)
perl       -       kre      __         1.61 secs Wed Nov 25 12:20 (0:00:01.66)
perl       -       kre      __         1.70 secs Wed Nov 25 12:20 (0:00:01.77)
perl       -       kre      __         1.23 secs Wed Nov 25 12:20 (0:00:01.25)
perl       -       kre      __         1.72 secs Wed Nov 25 12:20 (0:00:01.84)
perl       -       kre      __         1.25 secs Wed Nov 25 12:20 (0:00:01.27)
        [lots deleted during this period, this was the laptop at home,
         processing mail an hour's worth at a time - the system had been
         running for about 38 hours or so, it stayed home all Tuesday,
         without rebooting during that period - there would have been
         slow processing periods during that, but none from when the
         process accounting file was regenerated by the nightly script]
perl       -       kre      __         1.62 secs Wed Nov 25 10:20 (0:00:01.66)
perl       -       kre      __         1.61 secs Wed Nov 25 10:20 (0:00:01.64)
perl       -       kre      __         1.61 secs Wed Nov 25 10:20 (0:00:01.61)
perl       -       kre      __         1.92 secs Wed Nov 25 10:20 (0:00:01.97)
        [more deleted]
perl       -       kre      __         1.72 secs Wed Nov 25 09:20 (0:00:01.77)
perl       -       kre      __         1.62 secs Wed Nov 25 09:20 (0:00:01.67)
perl       -       kre      __         1.67 secs Wed Nov 25 09:20 (0:00:01.70)
perl       -       kre      __         1.66 secs Wed Nov 25 09:20 (0:00:01.72)
perl       -       kre      __         2.34 secs Wed Nov 25 09:20 (0:00:02.34)
perl       -       kre      __         1.23 secs Wed Nov 25 09:20 (0:00:01.25)
perl       -       kre      __         1.56 secs Wed Nov 25 09:20 (0:00:01.58)
perl       -       kre      __         1.69 secs Wed Nov 25 09:20 (0:00:01.72)
perl       -       kre      __         1.59 secs Wed Nov 25 09:20 (0:00:01.64)
perl       -       kre      __         1.61 secs Wed Nov 25 08:21 (0:00:01.66)
perl       -       kre      __         1.25 secs Wed Nov 25 08:21 (0:00:01.27)
perl       -       kre      __         2.56 secs Wed Nov 25 08:21 (0:00:02.73)
perl       -       kre      __         1.83 secs Wed Nov 25 08:21 (0:00:01.92)
perl       -       kre      __         2.34 secs Wed Nov 25 08:21 (0:00:02.36)
perl       -       kre      __         1.25 secs Wed Nov 25 08:21 (0:00:01.27)
perl       -       kre      __         1.64 secs Wed Nov 25 08:21 (0:00:01.66)
perl       -       kre      __         1.59 secs Wed Nov 25 08:21 (0:00:01.64)
perl       -       kre      __         1.64 secs Wed Nov 25 08:21 (0:00:01.66)
perl       -       kre      __         1.62 secs Wed Nov 25 08:21 (0:00:01.67)
perl       -       kre      __         1.23 secs Wed Nov 25 08:21 (0:00:01.25)
        [more deleted]
perl       -       kre      __         1.59 secs Wed Nov 25 07:20 (0:00:01.61)
perl       -       kre      __         1.61 secs Wed Nov 25 07:20 (0:00:01.64)
perl       -       kre      ttyp5      1.39 secs Wed Nov 25 07:10 (0:00:01.41)
perl       -       kre      ttyp5      1.39 secs Wed Nov 25 07:08 (0:00:01.42)
perl       -       kre      ttyp5      1.56 secs Wed Nov 25 07:08 (0:00:01.61)
        [those ttyp5 perls have to be something else, no idea what now,
         ... irrelevant]
perl       -       kre      __         2.27 secs Wed Nov 25 06:20 (0:00:03.02)
perl       -       kre      __         1.69 secs Wed Nov 25 06:20 (0:00:01.70)
perl       -       kre      __         1.59 secs Wed Nov 25 06:20 (0:00:01.62)
perl       -       kre      __         1.66 secs Wed Nov 25 06:20 (0:00:01.69)
perl       -       kre      __         1.64 secs Wed Nov 25 06:20 (0:00:01.67)
perl       -       kre      __         1.59 secs Wed Nov 25 06:20 (0:00:01.64)
        [lots more deleted]
perl       -       kre      __         1.58 secs Wed Nov 25 03:22 (0:00:01.61)
perl       -       kre      __         1.64 secs Wed Nov 25 03:22 (0:00:01.66)
perl       -       kre      __         1.64 secs Wed Nov 25 03:22 (0:00:01.77)
perl       -       kre      __         1.61 secs Wed Nov 25 03:22 (0:00:01.62)
perl       -       kre      __         1.67 secs Wed Nov 25 03:22 (0:00:02.33)
        [previous mail processing would have been 02:20 approx, in the
         previous day's accounting file]

spamassassin is run via procmail, but neither procmail, nor formail
(which procmail also runs) nor the nmh rcvstore (which procmail sometimes
runs) ever seem to run particularly slowly - not that they do very much
(but some of the messages - in particular netbsd list messages - go into
a very big mail directory, it currently has 70K messages approx, and
probably another couple of thousand ,N messages - nmh "deleted" (rmm'd)
but not yet gone from the filesys - just finding the next message number
to use, must take some cpu time (which would be rcvstore cpu time.)

Here's a complete ttyp2 process list (from lastcomm) during a part of the
"slow" period above) ...

procmail   -       kre      ttyp2      0.09 secs Wed Nov 25 17:55 (0:00:24.19)
procmail   -F      kre      ttyp2      0.00 secs Wed Nov 25 17:55 (0:00:24.03)
perl       -       kre      ttyp2     13.80 secs Wed Nov 25 17:55 (0:00:23.98)
formail    -       kre      ttyp2      0.03 secs Wed Nov 25 17:55 (0:00:00.03)
procmail   -       kre      ttyp2      0.03 secs Wed Nov 25 17:55 (0:00:00.09)
formail    -       kre      ttyp2      0.03 secs Wed Nov 25 17:55 (0:00:00.03)
procmail   -       kre      ttyp2      0.06 secs Wed Nov 25 17:55 (0:00:00.47)
rcvstore   -       kre      ttyp2      0.31 secs Wed Nov 25 17:55 (0:00:00.34)
formail    -       kre      ttyp2      0.03 secs Wed Nov 25 17:55 (0:00:00.03)
procmail   -       kre      ttyp2      0.09 secs Wed Nov 25 17:55 (0:00:00.16)
formail    -       kre      ttyp2      0.03 secs Wed Nov 25 17:55 (0:00:00.03)
procmail   -       kre      ttyp2      0.09 secs Wed Nov 25 17:55 (0:00:23.30)
procmail   -F      kre      ttyp2      0.00 secs Wed Nov 25 17:55 (0:00:23.14)
perl       -       kre      ttyp2     12.94 secs Wed Nov 25 17:55 (0:00:23.09)
formail    -       kre      ttyp2      0.03 secs Wed Nov 25 17:55 (0:00:00.03)
procmail   -       kre      ttyp2      0.12 secs Wed Nov 25 17:54 (0:00:44.31)
rcvstore   -       kre      ttyp2      0.09 secs Wed Nov 25 17:55 (0:00:00.11)
procmail   -F      kre      ttyp2      0.00 secs Wed Nov 25 17:54 (0:00:20.22)
perl       -       kre      ttyp2     10.08 secs Wed Nov 25 17:54 (0:00:20.19)
procmail   -F      kre      ttyp2      0.00 secs Wed Nov 25 17:54 (0:00:23.78)

Note that not all incoming messages are sent to spamassassin, stuff like
mailing lists (like the netbsd lists) are just filed away, I only check for
spam on mail that is likely to in fact be spam.  Note that only perl seems
affected by the slowdown - unless it is just that the others use so little time
that even at 10 times as slow they aren't detectable - that's possible
here, but if everything was 10 times as slow as "normal" in a slow period,
I think I'd have noticed more other processes going slow, than the three I do
observe - it is just those three I am aware of.





Home | Main Index | Thread Index | Old Index