Number: 53591
Category: kern
Synopsis: [system] process uses >400% CPU on idle machine
Confidential: no
Severity: serious
Priority: high
Responsible: kern-bug-people
State: open
Class: sw-bug
Submitter-Id: net
Arrival-Date: Tue Sep 11 08:50:00 +0000 2018
Originator: Andreas Gustafsson
Release: NetBSD 8.0
Organization:
Environment:
System: NetBSD guido
Architecture: x86_64
Machine: amd64
Description:
My 12-core HP DL360 G7 system running NetBSD/amd64 8.0 has now somehow
gotten itself into a state where the [system] process is using >400%
CPU even though the system is idle. "top" shows:
load averages: 0.00, 0.00, 0.80; up 1+18:48:30
51 processes: 45 sleeping, 4 stopped, 2 on CPU
CPU states: 0.0% user, 0.0% nice, 34.8% system, 0.0% interrupt, 65.1% idle
Memory: 20G Act, 10G Inact, 348K Wired, 33M Exec, 4875M File, 62M Free
Swap:
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
0 root 0 0 0K 133M CPU/11 507:36 0.00% 353% [system]
484 pgsql 85 0 77M 4572K select/7 2:45 0.00% 0.00% postgres
6099 gson 85 0 95M 3020K select/6 0:58 0.00% 0.00% sshd
Pressing the "t" key shows that the kernel threads eating CPU are
the pgdaemon and xcall threads:
load averages: 0.00, 0.00, 0.76; up 1+18:49:12
217 threads: 49 idle, 1 runnable, 146 sleeping, 8 stopped, 1 zombie, 12 on CPU
CPU states: 0.0% user, 0.0% nice, 35.8% system, 0.0% interrupt, 64.1% idle
Memory: 20G Act, 10G Inact, 348K Wired, 33M Exec, 4875M File, 62M Free
Swap:
PID LID USERNAME PRI STATE TIME WCPU CPU NAME COMMAND
0 7 root 127 xcall/0 43:21 61.96% 61.96% xcall/0 [system]
0 22 root 127 xcall/1 42:08 47.22% 47.22% xcall/1 [system]
0 28 root 127 xcall/2 39:35 42.97% 42.97% xcall/2 [system]
0 34 root 127 RUN/3 34:54 31.59% 31.59% xcall/3 [system]
0 52 root 127 xcall/6 29:36 30.96% 30.96% xcall/6 [system]
0 58 root 127 xcall/7 28:53 29.88% 29.88% xcall/7 [system]
0 70 root 127 xcall/9 26:41 29.69% 29.69% xcall/9 [system]
0 64 root 127 xcall/8 26:46 29.49% 29.49% xcall/8 [system]
0 156 root 126 xclocv/1 92:15 29.44% 29.44% pgdaemon [system]
0 82 root 127 xcall/11 24:05 28.47% 28.47% xcall/11 [system]
0 46 root 127 xcall/5 31:20 28.12% 28.12% xcall/5 [system]
0 40 root 127 xcall/4 30:48 25.29% 25.29% xcall/4 [system]
0 76 root 127 xcall/10 24:03 25.05% 25.05% xcall/10 [system]
0 157 root 124 syncer/4 22:45 0.00% 0.00% ioflush [system]
0 158 root 125 aiodon/9 5:12 0.00% 0.00% aiodoned [system]
0 84 root 96 ipmicm/1 5:04 0.00% 0.00% ipmi [system]
484 1 pgsql 85 select/2 2:45 0.00% 0.00% - postgres
0 9 root 125 vdrain/1 1:17 0.00% 0.00% vdrain [system]
0 159 root 123 physio/0 1:12 0.00% 0.00% physiod [system]
Output from "vmstat 1":
procs memory page disks faults cpu
r b avm fre flt re pi po fr sr l0 s0 in sy cs us sy id
1 8 21024468 74920 15313 1 0 0 191 532 79 44 170 11879 38629 3 3 93
0 8 21024468 74920 1 0 0 0 0 0 0 0 8 121 960529 0 36 64
0 8 21024468 74668 613 0 0 0 0 0 0 3 27 316 951463 0 37 63
0 8 21024468 74672 0 0 0 0 0 0 0 0 3 25 958574 0 37 63
0 8 21024468 74672 0 0 0 0 0 0 0 0 2 28 962733 0 35 65
0 8 21024468 74940 0 0 0 0 0 0 0 0 2 25 957158 0 36 64
0 8 21024468 74940 0 0 0 0 0 0 0 0 4 106 953688 0 37 63
I will try to avoid rebooting for 24 hours in case someone wants me to
run other diagnostics.
How-To-Repeat:
Don't know, this has only happened once so far. I had been using dtrace,
so maybe that's what triggered it. Or not.
Fix: