Re: kern/53591: [system] process uses >400% CPU on idle machine

To: gnats-bugs%NetBSD.org@localhost, kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/53591: [system] process uses >400% CPU on idle machine
From: Lars Reichardt <lars%paradoxon.info@localhost>
Date: Tue, 11 Sep 2018 12:09:35 +0200


On 9/11/18 10:50 AM, Andreas Gustafsson wrote:

Number:         53591
Category:       kern
Synopsis:       [system] process uses >400% CPU on idle machine
Confidential:   no
Severity:       serious
Priority:       high
Responsible:    kern-bug-people
State:          open
Class:          sw-bug
Submitter-Id:   net
Arrival-Date:   Tue Sep 11 08:50:00 +0000 2018
Originator:     Andreas Gustafsson
Release:        NetBSD 8.0
Organization:
Environment:

System: NetBSD guido
Architecture: x86_64
Machine: amd64

Description:

My 12-core HP DL360 G7 system running NetBSD/amd64 8.0 has now somehow
gotten itself into a state where the [system] process is using >400%
CPU even though the system is idle.  "top" shows:

   load averages:  0.00,  0.00,  0.80;               up 1+18:48:30
   51 processes: 45 sleeping, 4 stopped, 2 on CPU
   CPU states:  0.0% user,  0.0% nice, 34.8% system,  0.0% interrupt, 65.1% idle
   Memory: 20G Act, 10G Inact, 348K Wired, 33M Exec, 4875M File, 62M Free
   Swap:

     PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
       0 root       0    0     0K  133M CPU/11   507:36  0.00%   353% [system]
     484 pgsql     85    0    77M 4572K select/7   2:45  0.00%  0.00% postgres
    6099 gson      85    0    95M 3020K select/6   0:58  0.00%  0.00% sshd

Pressing the "t" key shows that the kernel threads eating CPU are
the pgdaemon and xcall threads:

   load averages:  0.00,  0.00,  0.76;               up 1+18:49:12
   217 threads: 49 idle, 1 runnable, 146 sleeping, 8 stopped, 1 zombie, 12 on CPU
   CPU states:  0.0% user,  0.0% nice, 35.8% system,  0.0% interrupt, 64.1% idle
   Memory: 20G Act, 10G Inact, 348K Wired, 33M Exec, 4875M File, 62M Free
   Swap:

     PID   LID USERNAME PRI STATE      TIME   WCPU    CPU NAME      COMMAND
       0     7 root     127 xcall/0   43:21 61.96% 61.96% xcall/0   [system]
       0    22 root     127 xcall/1   42:08 47.22% 47.22% xcall/1   [system]
       0    28 root     127 xcall/2   39:35 42.97% 42.97% xcall/2   [system]
       0    34 root     127 RUN/3     34:54 31.59% 31.59% xcall/3   [system]
       0    52 root     127 xcall/6   29:36 30.96% 30.96% xcall/6   [system]
       0    58 root     127 xcall/7   28:53 29.88% 29.88% xcall/7   [system]
       0    70 root     127 xcall/9   26:41 29.69% 29.69% xcall/9   [system]
       0    64 root     127 xcall/8   26:46 29.49% 29.49% xcall/8   [system]
       0   156 root     126 xclocv/1  92:15 29.44% 29.44% pgdaemon  [system]
       0    82 root     127 xcall/11  24:05 28.47% 28.47% xcall/11  [system]
       0    46 root     127 xcall/5   31:20 28.12% 28.12% xcall/5   [system]
       0    40 root     127 xcall/4   30:48 25.29% 25.29% xcall/4   [system]
       0    76 root     127 xcall/10  24:03 25.05% 25.05% xcall/10  [system]
       0   157 root     124 syncer/4  22:45  0.00%  0.00% ioflush   [system]
       0   158 root     125 aiodon/9   5:12  0.00%  0.00% aiodoned  [system]
       0    84 root      96 ipmicm/1   5:04  0.00%  0.00% ipmi      [system]
     484     1 pgsql     85 select/2   2:45  0.00%  0.00% -         postgres
       0     9 root     125 vdrain/1   1:17  0.00%  0.00% vdrain    [system]
       0   159 root     123 physio/0   1:12  0.00%  0.00% physiod   [system]

Output from "vmstat 1":

    procs    memory      page                       disks   faults      cpu
    r b      avm    fre  flt  re  pi   po   fr   sr l0 s0   in   sy  cs us sy id
    1 8 21024468  74920 15313  1   0    0  191  532 79 44  170 11879 38629 3 3 93
    0 8 21024468  74920    1   0   0    0    0    0  0  0    8  121 960529 0 36 64
    0 8 21024468  74668  613   0   0    0    0    0  0  3   27  316 951463 0 37 63
    0 8 21024468  74672    0   0   0    0    0    0  0  0    3   25 958574 0 37 63
    0 8 21024468  74672    0   0   0    0    0    0  0  0    2   28 962733 0 35 65
    0 8 21024468  74940    0   0   0    0    0    0  0  0    2   25 957158 0 36 64
    0 8 21024468  74940    0   0   0    0    0    0  0  0    4  106 953688 0 37 63

I will try to avoid rebooting for 24 hours in case someone wants me to
run other diagnostics.

How-To-Repeat:

Don't know, this has only happened once so far.  I had been using dtrace,
so maybe that's what triggered it.  Or not.

Fix:

How much memory does the machine have, maybe some pools (with larger thePAGE_SIZE allocators) have eaten all the kmem_va space?


What does vmstat -mv show?

References:
- kern/53591: [system] process uses >400% CPU on idle machine
  - From: Andreas Gustafsson

Prev by Date: Re: kern/53591: [system] process uses >400% CPU on idle machine
Next by Date: Re: kern/53591: [system] process uses >400% CPU on idle machine
Previous by Thread: kern/53591: [system] process uses >400% CPU on idle machine
Next by Thread: Re: kern/53591: [system] process uses >400% CPU on idle machine
Indexes:

Home | Main Index | Thread Index | Old Index