tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Random lockups on an email server - possibly kern/50168



On Sun, 3 Apr 2016 09:51:08 -0400
"D'Arcy J.M. Cain" <darcy%NetBSD.org@localhost> wrote:
> Meanwhile, my system crashed again.  I have taken to rebooting every
> morning (better a controlled five minute down time than a minimum half
> hour crash).  Here is what was on the screen when it locked up.

Based on discussions with David Maxwell I took out the daily reboot and
ran crash in a screen(1) terminal.  The idea was that if I was already
in crash I could run some commands.

Today it hung again.  Here's the output of top when it hung:

load averages:  0.33,  0.31,  0.55;               up 2+21:36:26        08:11:40
494 processes: 461 sleeping, 31 zombie, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Memory: 19G Act, 9272M Inact, 11M Wired, 86M Exec, 26G File, 8584K Free
Swap: 32G Total, 32G Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
    0 root       0    0     0K   45M CPU/14    27:17  0.00%  0.00% [system]
  597 root     117    0    24M 2252K tstile/1   1:12  0.00%  0.00% syslogd
29434 root     117    0    25M   14M tstile/8   1:04  0.00%  0.00% rsync
  673 root      43    0    18M 3380K CPU/15     0:39  0.00%  0.00% top
15161 root      85    0    12M 2124K kqueue/1   0:18  0.00%  0.00% log
 1713 postgrey  85    0    83M   21M select/3   0:17  0.00%  0.00% perl
  234 mailman  117    0   129M   37M tstile/1   0:16  0.00%  0.00% python2.7
 1796 mailman  117    0   122M   25M tstile/1   0:16  0.00%  0.00% python2.7
22368 druid     85    0    16M 5024K kqueue/1   0:16  0.00%  0.00% imap
 2943 mailman  117    0   124M   30M tstile/1   0:15  0.00%  0.00% python2.7
 2469 mailman  117    0   115M   17M tstile/1   0:15  0.00%  0.00% python2.7
21549 root      85    0    16M 6824K kqueue/5   0:15  0.00%  0.00% config
26849 root     117    0    89M   55M tstile/1   0:14  0.00%  0.00% auth
  235 mailman  117    0   124M   29M tstile/1   0:14  0.00%  0.00% python2.7
  233 mailman  117    0   115M   16M tstile/1   0:14  0.00%  0.00% python2.7
 3024 mailman  117    0   115M   16M tstile/2   0:14  0.00%  0.00% python2.7
16888 darcy     85    0    16M 5048K kqueue/0   0:12  0.00%  0.00% imap
16363 www       85    0   354M   38M flt_no/8   0:11  0.00%  0.00% httpd
14358 www       85    0   355M   35M kqueue/1   0:11  0.00%  0.00% httpd
 1532 root      85    0    22M   10M pause/3    0:11  0.00%  0.00% ntpd
 2245 root      85    0    48M 2472K kqueue/0   0:10  0.00%  0.00% master
25121 root      85    0    12M 1940K flt_no/1   0:10  0.00%  0.00% dovecot
21209 www       85    0   355M   34M semwai/1   0:08  0.00%  0.00% httpd
19179 root      85    0    78M 7324K select/8   0:06  0.00%  0.00% sshd
18999 gogo2    117    0    17M 5000K tstile/1   0:05  0.00%  0.00% imap
27442 www       85    0   353M   33M semwai/9   0:05  0.00%  0.00% httpd
13590 www       85    0   351M   29M semwai/5   0:05  0.00%  0.00% httpd
 2430 darcy     85    0    20M 2156K select/0   0:04  0.00%  0.00% screen-4.3.1
 2807 jbelknap 117    0    19M 7716K tstile/0   0:03  0.00%  0.00% imap
  160 root      85    0   337M   26M select/8   0:03  0.00%  0.00% httpd

Crash didn't help.  When I pressed enter it dumped a ps output to the
screen, probably the last command I ran when the system was up.  Here
is a partial output of that as far back as screen would go.

0      129 3   4       200   fffffe813ac685e0          coretemp1 coretemp1
0      128 3  10       200   fffffe813ac68a00          coretemp0 coretemp0
0      127 3  11       200   fffffe813ac3f1a0              ciss0 ciss0
0      118 3   0       200   fffffe813ab61140               pms0 pmsreset
0      117 3   0       200   fffffe813ab61560            atabus5 atath
0      116 3   0       200   fffffe813ab61980            atabus4 atath
0      115 3   1       200   fffffe813ab44120            atabus3 atath
0      114 3   1       200   fffffe813ab44540            atabus2 atath
0      113 3   0       200   fffffe813ab44960            atabus1 atath
0      112 3   0       200   fffffe813aa7e100            atabus0 atath
0      111 3   0       200   fffffe813aa7e520         usbtask-dr usbtsk
0      110 3   0       200   fffffe813aa7e940         usbtask-hc usbtsk
0      109 3   0       200   fffffe813a8720e0           scsibus0 sccomp
0      108 3   1       200   fffffe813a872500           lnxsyswq lnxsyswq
0      107 3   4       200   fffffe813a872920               ipmi ipmipoll
0      106 3  15       200   fffffe813a7f20c0           xcall/15 xcall
0      105 1  15       200   fffffe813a7f24e0         softser/15
0      104 1  15       200   fffffe813a7f2900         softclk/15
0      103 1  15       200   fffffe813a7db0a0         softbio/15
0      102 1  15       200   fffffe813a7db4c0         softnet/15
0      101 1  15       201   fffffe813a7db8e0            idle/15
0      100 3  14       200   fffffe813a7ce080           xcall/14 xcall
0       99 1  14       200   fffffe813a7ce4a0         softser/14
0       98 1  14       200   fffffe813a7ce8c0         softclk/14
0       97 1  14       200   fffffe813a7b9060         softbio/14
0       96 1  14       200   fffffe813a7b9480         softnet/14
0    >  95 7  14       201   fffffe813a7b98a0            idle/14
0       94 3  13       200   fffffe813a7aa040           xcall/13 xcall
0       93 1  13       200   fffffe813a7aa460         softser/13
0       92 1  13       200   fffffe813a7aa880         softclk/13
0       91 1  13       200   fffffe813a795020         softbio/13
0       90 1  13       200   fffffe813a795440         softnet/13
0    >  89 7  13       201   fffffe813a795860            idle/13
0       88 3  12       200   fffffe813a776000           xcall/12 xcall
0       87 1  12       200   fffffe813a776420         softser/12
0       86 1  12       200   fffffe813a776840         softclk/12
0       85 1  12       200   fffffe813a757360         softbio/12
0       84 1  12       200   fffffe813a757780         softnet/12
0    >  83 7  12       201   fffffe813a757ba0            idle/12
0       82 3  11       200   fffffe813a752340           xcall/11 xcall
0       81 1  11       200   fffffe813a752760         softser/11
0       80 1  11       200   fffffe813a752b80         softclk/11
0       79 1  11       200   fffffe813a75c320         softbio/11
0       78 1  11       200   fffffe813a75c740         softnet/11
0    >  77 7  11       201   fffffe813a75cb60            idle/11
0       76 3  10       200   fffffe813a736300           xcall/10 xcall
0       75 1  10       200   fffffe813a736720         softser/10
0       74 1  10       200   fffffe813a736b40         softclk/10
0       73 1  10       200   fffffe813a70f2e0         softbio/10
0       72 1  10       200   fffffe813a70f700         softnet/10
0    >  71 7  10       201   fffffe813a70fb20            idle/10
0       70 3   9       200   fffffe813a70a2c0            xcall/9 xcall
0       69 1   9       200   fffffe813a70a6e0          softser/9
0       68 1   9       200   fffffe813a70ab00          softclk/9
0       67 1   9       200   fffffe813a70b2a0          softbio/9
0       66 1   9       200   fffffe813a70b6c0          softnet/9
0    >  65 7   9       201   fffffe813a70bae0             idle/9
0       64 3   8       200   fffffe813a6fe280            xcall/8 xcall
0       63 1   8       200   fffffe813a6fe6a0          softser/8
0       62 1   8       200   fffffe813a6feac0          softclk/8
0       61 1   8       200   fffffe813a6e8260          softbio/8
0       60 1   8       200   fffffe813a6e8680          softnet/8
0    >  59 7   8       201   fffffe813a6e8aa0             idle/8
0       58 3   7       200   fffffe813a6b2240            xcall/7 xcall
0       57 1   7       200   fffffe813a6b2660          softser/7
0       56 1   7       200   fffffe813a6b2a80          softclk/7
0       55 1   7       200   fffffe813a6c3220          softbio/7
0       54 1   7       200   fffffe813a6c3640          softnet/7
0    >  53 7   7       201   fffffe813a6c3a60             idle/7
0       52 3   6       200   fffffe813a6b6200            xcall/6 xcall
0       51 1   6       200   fffffe813a6b6620          softser/6
0       50 1   6       200   fffffe813a6b6a40          softclk/6
0       49 1   6       200   fffffe813a6a01e0          softbio/6
0       48 1   6       200   fffffe813a6a0600          softnet/6
0    >  47 7   6       201   fffffe813a6a0a20             idle/6
0       46 3   5       200   fffffe813a67a1c0            xcall/5 xcall
0       45 1   5       200   fffffe813a67a5e0          softser/5
0       44 1   5       200   fffffe813a67aa00          softclk/5
0       43 1   5       200   fffffe813a6831a0          softbio/5
0       42 1   5       200   fffffe813a6835c0          softnet/5
0    >  41 7   5       201   fffffe813a6839e0             idle/5
0       40 3   4       200   fffffe813a66f180            xcall/4 xcall
0       39 1   4       200   fffffe813a66f5a0          softser/4
0       38 1   4       200   fffffe813a66f9c0          softclk/4
0       37 1   4       200   fffffe813a65f160          softbio/4
0       36 1   4       200   fffffe813a65f580          softnet/4
0    >  35 7   4       201   fffffe813a65f9a0             idle/4
0       34 3   3       200   fffffe813a629140            xcall/3 xcall
0       33 1   3       200   fffffe813a629560          softser/3
0       32 1   3       200   fffffe813a629980          softclk/3
0       31 1   3       200   fffffe813a61a120          softbio/3
0       30 1   3       200   fffffe813a61a540          softnet/3
0    >  29 7   3       201   fffffe813a61a960             idle/3
0       28 3   2       200   fffffe813a62d100            xcall/2 xcall
0       27 1   2       200   fffffe813a62d520          softser/2
0       26 1   2       200   fffffe813a62d940          softclk/2
0       25 1   2       200   fffffe813a6130e0          softbio/2
0       24 1   2       200   fffffe813a613500          softnet/2
0    >  23 7   2       201   fffffe813a613920             idle/2
0       22 3   1       200   fffffe813a6050c0            xcall/1 xcall
0       21 1   1       200   fffffe813a6054e0          softser/1
0       20 1   1       200   fffffe813a605900          softclk/1
0       19 1   1       200   fffffe813a5e80a0          softbio/1
0       18 1   1       200   fffffe813a5e84c0          softnet/1
0       17 1   1       201   fffffe813a5e88e0             idle/1
0       16 3   0       200   fffffe8836ef4080             sysmon smtaskq
0       15 3   0       200   fffffe8836ef44a0         pmfsuspend pmfsuspend
0       14 3   6       200   fffffe8836ef48c0           pmfevent pmfevent
0       13 3   0       200   fffffe883af10060         sopendfree sopendfr
0       12 3   0       200   fffffe883af10480           nfssilly nfssilly
0       11 3  11       200   fffffe883af108a0            cachegc cachegc
0       10 3   4       200   fffffe883df18040              vrele vrele
0        9 3  15       200   fffffe883df18460             vdrain vdrain
0        8 3   3       200   fffffe883df18880          modunload mod_unld
0        7 3   0       200   fffffe883df24020            xcall/0 xcall
0        6 1   0       200   fffffe883df24440          softser/0
0        5 1   0       200   fffffe883df24860          softclk/0
0        4 1   0       200   fffffe883df2a000          softbio/0
0        3 1   0       200   fffffe883df2a420          softnet/0
0        2 1   0       201   fffffe883df2a840             idle/0
0        1 3   7       200   ffffffff810345a0            swapper uvm

I tried doing ps/n|more and crash just hung.

I was able to get someone to plug in a monitor and keyboard.  He read
this off the screen.

07:56:55 smaug dovecot: imap (eref): fatal: master: service (imap):
   child 11193 killed with signal 6 (core not dumped) set service
   imap (drop_priv_before_exec=yes)
08:07:09 smaug dovecot: imap (eref): panic: file imap-client.c: line 841
   (client_check_command_hangs): assertion failed:
(!have_wait_unfinished || unfinished_count > 0)
08:07:09 smaug dovecot: imap (eref): fatal: master: service (imap):
   child 4798 killed with signal 6 (core not dumped) set service imap
   (drop_priv_before_exec=yes)

I am going to look at those sources but I suspect that this is a
symptom, not a cause.

I had the on-site person press <CTRL><ALT><ESC> but it did not drop
into the debugger.

-- 
D'Arcy J.M. Cain <darcy%NetBSD.org@localhost>
http://www.NetBSD.org/ IM:darcy%Vex.Net@localhost


Home | Main Index | Thread Index | Old Index