tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Wild CPU usage times on NetBSD 5



    Date:        Wed, 25 Nov 2009 10:12:54 -0800
    From:        buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow)
    Message-ID:  <200911251812.nAPICstK012464%lothlorien.nfbcal.org@localhost>

  |     Hello.  In reading this message, it looks to me like you're getting
  | stuck in the tstile bug, or, at least, something similar.

Thanks for the comment - possible - I'd seen the messages about that,
but mostly only superficially followed the discussion (on a "not relevant
here and nothing I can do about it" basis).

But ...

  | The commonality
  | between all your applications is the kernel, but it's also the filesystem.

Yes, maybe - they're all certainly accessing files on the same FFS.

  | You didn't say, but I imagine, that the directories you have Magic Point
  | look in,

Magicpoint could have, perhaps, yes - the directory containing its
image files tends to be quite big (ends up with lots of variant formats
of quite a lot of image files (that is originally diagrams from xfig in
most cases) for various reasons).   The directory with the file magicpoint
is reading is nothing special (maybe a hundred files or so), but there's
no reason it would be touching that directory at all, just a read of a KB
or so from a file it already had open from there (the page source - it
contains references to the images as separate files, but everything else
is in the one file).  I must try building an embedded magicpoint file (which
builds the images into the file itself) and see if that one ever encounters
the same kind of problem.

The problem with this theory is that I suspect that when I have seen this
happen, magicpoint has been (sometimes) accessing pages with no images
at all, which should give it no reason to be accessing the filesystem
any more than reading from a single open file (and not all that much).

  | as well as Spam Assassin contain a large number of files,

That one, no, not many files in any directory spamassassin touches while
processing incoming mail - my mail directories contain huge numbers of
files, so if rcvstore, or procmail, both of which write in those were
having problems, that would make more sense, but spamassassin is just
reading a file in /tmp or, perhaps even via stdin - the procmailrc file
says "| /usr/pkg/bin/spamassassin -L", but I don't know for sure how
procmail implements that - its database directory contains just a half
dozen files (big ones, db format I think, most of them, but not many of them).

  | just as your Internet junk directory does.

Only maybe my netbsd mail directory ("folder" if you like) contains anywhere
near that number of files... and while procmail & rcvstore would go there,
spamassassin certainly never would (it never even looks at the messages
that end up there, nor in other largish, but not that large, list
directories).

  | Are you running softdeps or filesystem logging?

No, the filesystem (my home filesys) is a 100% default FFSv1.
It is about 30GB, and currently close to 90% full.  It is a 16K/2K
filesys block size, and I probably changed the default file size to
get a more rational number of inodes than you'd normally get by default
(way too many for me on this filesys).

Looking at it now, dumpfs says "sblock  FFSv2   fslevel 4" which
doesn't make much sense to me, I doubt I would have deliberately made
a FFSv2 filesys?   On the other hand it also says "magic   11954 (UFS1)"
so now I'm just confused, I need to go do some more looking in the
filesys code - it's been a long time since I was near that stuff!

  | This really sounds like a filesystem issue.

What's confusing about that is that I'd expect either just "missing" time,
(that is, real time much longer than measured cpu time, with no real
explanation what was happening while we're waiting) or high system time for
a filesystem related problem - what I'm seeing is a 100% busy CPU, all
recorded as being user time.   That doesn't seem to fit - but as I don't
know what the tstile bug symptoms really are (I didn't pay that much
attention, I'm afraid, I'll go hunt in list archives now) I'm not sure.

Is there some way to check (while it is happening) whether this is what
it might be?   (It hasn't happened to me again since the spamassassin
slowdown around 17:50-16:00 yesterday afternoon (Wednesday, it is early
Thurs here already) but I'm sure it will again, I see this quite a lot...)
If it happens during mail processing, of my index generation (that one
is rarer, as I don't rebuild the index all that often) I could do almost
anything to look (though my kernel has no diagnostic aids built in, no
DIAGNOSITC or DEBUG, and no ddb).

I can always build a new kernel though if that will help (I just switched,
at least for now, to a 5.0_STABLE as of 2009-11-26) - that's what I'm
running now (but only for the past few hours, certainly not long enough
to mean anything from not having seen the slowdown in this period).

  | I realize you're not
  | getting any i/o throughput, but I think I've seen similar things under
  | NetBSD-5, and they just clear up without apparent explanation as well.

Yes, that certainly fits, there certainly is not no I/O, just I/O isn't
anything I'd think remarkable (nothing even close to what you'd see
unpacking a tar file, or probably even doing a routine compile).
During spamassassin, while it is in its CPU loop, systat can tell me
I/O rates like 1900 bytes/sec (yes, bytes, not KB or blocks or something),
though it will ocasionally peak at around 10-20 KB/sec for a very short
time (probably when the database is being searched)

Thanks for the clue, it is certainly something else to consider, I never
even considered looking in that direction.

kre



Home | Main Index | Thread Index | Old Index