Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: recurring tstile hangs on -current



Hi Thomas,

glad that this is observed elsewhere.

Maybe following bugs could resonate with your observations:

kern/54207 [serious/high]:
        -current locks up solidly when pkgsrc building adapta-gtk-theme-3.95.0.11
looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)

kern/54210 [serious/high]:
        NetBSD-8 processes presumably not exiting
not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, ~380Gb)

At this time I am not too confident, that -current is reliably able to do a pkgsrc build, though I have seen occasionally bulk builds that did finish.
Most of the time I run into hard lockups with no information about the system state available (no console, no X, no network, no DDB).

Frank


On 06/28/19 10:46, Thomas Klausner wrote:
Hi!

I've set up a new machine for bulk building. I have tried various
things, but in the end it always hangs in tstile.

First try was what I currently use: tmpfs sandboxes with nullfs
mounted /bin, /lib, ... When it hung, the suspicion was that it's
nullfs' fault. (The same setup works fine on my current machine.)

The second try was tmpfs with copied-in /bin, /lib, ... and
NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
hung. So the suspicion was that tmpfs or NFS are broken.

The last try was building in the root file system, i.e. not even a
sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
finished one pkgsrc build (where some packages didn't build because of
missing distfiles, or because they randomly break like rust). When I
restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.

I have the following systat output:

     2 users    Load  8.78  7.19  3.62                  Fri Jun 28 04:27:32

Proc:r  d  s        Csw  Traps SysCal  Intr   Soft  Fault     PAGING   SWAPPING
     24    10       7548 265849 157956  3504   2399 265476     in  out   in  out
                                                         ops
   56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Id    pages
|    |    |    |    |    |    |    |    |    |    |
============================>                                         670 forks
                                                                           fkppw
Anon       294104    %   zero 62161268      5572 Interrupts               fksvm
Exec        14116    %   wired   16296      1968 TLB shootdown            pwait
File     24587740  18%   inact   43756       100 cpu0 timer               relck
Meta      2606694    %   bufs   495676           msi1 vec 0               rlkok
  (kB)        real   swaponly      free         9 msix2 vec 0              noram
Active   24835908            100033996         9 msix2 vec 1        57262 ndcpy
Namei         Sys-cache     Proc-cache           msix2 vec 2        27906 fltcp
     Calls     hits    %     hits     %      3427 ioapic1 pin 12     87178 zfod
    125076   122834   98       80     0        59 ioapic2 pin 0      35775 cow
                                                  msix7 vec 0         8192 fmin
   Disks:   seeks   xfers   bytes   %busy                            10922 ftarg
      ld0            1969  16130K    34.8                                  itarg
      dk0            1969  16130K    34.8                                  flnan
      wd0                                                                  pdfre
      dk1                                                                  pdscn
      dk2

and this from top:

load averages:  5.13,  6.53,  3.56;               up 1+16:08:05                                                                                                                                                          04:28:13
59 processes: 2 runnable, 55 sleeping, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
Swap: 163G Total, 163G Free

   PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
10353 pbulk     77    0   185M  172M select/0   0:13  4.74%  4.54% bjam
12120 wiz      109    0    83M   59M tstile/1 165:46  1.46%  1.46% systat
     0 root       0    0     0K   93M CPU/31    35:39  0.00%  0.00% [system]
   219 root      85    0    32M 2676K kqueue/4   7:34  0.00%  0.00% syslogd
13354 wiz       85    0    89M 4948K select/0   0:52  0.00%  0.00% sshd
   380 root      85    0    30M   16M pause/4    0:04  0.00%  0.00% ntpd
10918 wiz       43    0    25M 2872K CPU/3      0:01  0.00%  0.00% top
     1 root      85    0    20M 1756K wait/29    0:01  0.00%  0.00% init
  5594 pbulk      0    0     0K    0K RUN/0      0:00  0.00%  0.00% bjam
22861 pbulk      0    0     0K    0K RUN/0      0:00  0.00%  0.00% bjam
   747 root     117    0    20M 2080K tstile/8   0:00  0.00%  0.00% cron
16473 pbulk    117    0    18M 1564K tstile/2   0:00  0.00%  0.00% cp
  9705 pbulk    117    0    15M 1564K bioloc/5   0:00  0.00%  0.00% cp
  7301 pbulk    117    0    15M 1560K tstile/2   0:00  0.00%  0.00% cp
22971 pbulk    117    0    19M 1520K tstile/1   0:00  0.00%  0.00% cp
10013 pbulk    117    0    15M 1520K tstile/1   0:00  0.00%  0.00% cp
  3411 pbulk    117    0    15M 1520K tstile/3   0:00  0.00%  0.00% cp
  5212 pbulk    117    0    15M 1520K tstile/2   0:00  0.00%  0.00% cp
  7072 pbulk    117    0    18M 1516K tstile/2   0:00  0.00%  0.00% cp
  8880 pbulk    117    0    15M 1516K tstile/2   0:00  0.00%  0.00% cp
  5869 pbulk    117    0    15M 1516K tstile/0   0:00  0.00%  0.00% cp
10159 pbulk    117    0    15M 1516K tstile/1   0:00  0.00%  0.00% cp
11783 pbulk    117    0    15M 1516K tstile/7   0:00  0.00%  0.00% cp
  7205 pbulk    117    0    15M 1512K tstile/1   0:00  0.00%  0.00% cp
18676 pbulk    109    0    15M 1516K tstile/3   0:00  0.00%  0.00% cp
  7802 pbulk    109    0    15M 1516K tstile/2   0:00  0.00%  0.00% cp
   622 pbulk    109    0    15M 1512K tstile/2   0:00  0.00%  0.00% cp
29434 pbulk    109    0  9576K  680K tstile/2   0:00  0.00%  0.00% cp
  2686 root      85    0    86M 6824K select/2   0:00  0.00%  0.00% sshd
10052 root      85    0    89M 6784K select/2   0:00  0.00%  0.00% sshd
   674 root      85    0    70M 5056K wait/18    0:00  0.00%  0.00% login
19345 wiz       85    0    86M 4960K select/3   0:00  0.00%  0.00% sshd
   652 postfix   85    0    57M 4848K kqueue/4   0:00  0.00%  0.00% qmgr
  4466 postfix   85    0    59M 4560K kqueue/0   0:00  0.00%  0.00% pickup
   441 root      85    0    70M 3412K select/2   0:00  0.00%  0.00% sshd
   656 root      85    0    57M 3328K kqueue/0   0:00  0.00%  0.00% master
   278 root      85    0    45M 2232K nfsd/31    0:00  0.00%  0.00% nfsd
   639 root      85    0    16M 2128K pause/0    0:00  0.00%  0.00% ksh
21402 root      85    0    20M 1988K wait/0     0:00  0.00%  0.00% sh
23371 root      85    0    20M 1972K wait/0     0:00  0.00%  0.00% sh
  3940 wiz       85    0    16M 1948K pause/23   0:00  0.00%  0.00% ksh
  8843 wiz       85    0    16M 1948K pause/5    0:00  0.00%  0.00% ksh
   227 root      85    0    20M 1940K select/1   0:00  0.00%  0.00% rpcbind
   698 root      85    0    20M 1836K ttyraw/3   0:00  0.00%  0.00% getty
   542 root      85    0    20M 1832K ttyraw/2   0:00  0.00%  0.00% getty
   535 root      85    0    20M 1832K ttyraw/0   0:00  0.00%  0.00% getty
   531 root      85    0    25M 1644K kqueue/3   0:00  0.00%  0.00% inetd
   329 root      85    0    24M 1524K select/2   0:00  0.00%  0.00% mountd
   436 root      85    0    20M 1516K kqueue/2   0:00  0.00%  0.00% powerd

On the console I see that it's currently trying to build
boost-headers, so it's not even something compile-heavy.

The machine is still in this state and I have a PS/2 keyboard
attached, so let me know if you want to check something out.

I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48).
The kernel config is

include "arch/amd64/conf/GENERIC"
options FONT_GO_MONO12x23
no options FONT_BOLD16x32
no options FONT_BOLD8x16

It's a 16-core AMD Threadripper system with 128GB RAM.

What could go wrong here? I'm running out of ideas.
  Thomas



Home | Main Index | Thread Index | Old Index