Hi!
I've set up a new machine for bulk building. I have tried various
things, but in the end it always hangs in tstile.
First try was what I currently use: tmpfs sandboxes with nullfs
mounted /bin, /lib, ... When it hung, the suspicion was that it's
nullfs' fault. (The same setup works fine on my current machine.)
The second try was tmpfs with copied-in /bin, /lib, ... and
NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
hung. So the suspicion was that tmpfs or NFS are broken.
The last try was building in the root file system, i.e. not even a
sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
finished one pkgsrc build (where some packages didn't build because of
missing distfiles, or because they randomly break like rust). When I
restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
I have the following systat output:
2 users Load 8.78 7.19 3.62 Fri Jun 28 04:27:32
Proc:r d s Csw Traps SysCal Intr Soft Fault PAGING SWAPPING
24 10 7548 265849 157956 3504 2399 265476 in out in out
ops
56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Id pages
| | | | | | | | | | |
============================> 670 forks
fkppw
Anon 294104 % zero 62161268 5572 Interrupts fksvm
Exec 14116 % wired 16296 1968 TLB shootdown pwait
File 24587740 18% inact 43756 100 cpu0 timer relck
Meta 2606694 % bufs 495676 msi1 vec 0 rlkok
(kB) real swaponly free 9 msix2 vec 0 noram
Active 24835908 100033996 9 msix2 vec 1 57262 ndcpy
Namei Sys-cache Proc-cache msix2 vec 2 27906 fltcp
Calls hits % hits % 3427 ioapic1 pin 12 87178 zfod
125076 122834 98 80 0 59 ioapic2 pin 0 35775 cow
msix7 vec 0 8192 fmin
Disks: seeks xfers bytes %busy 10922 ftarg
ld0 1969 16130K 34.8 itarg
dk0 1969 16130K 34.8 flnan
wd0 pdfre
dk1 pdscn
dk2
and this from top:
load averages: 5.13, 6.53, 3.56; up 1+16:08:05 04:28:13
59 processes: 2 runnable, 55 sleeping, 2 on CPU
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle
Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
Swap: 163G Total, 163G Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
10353 pbulk 77 0 185M 172M select/0 0:13 4.74% 4.54% bjam
12120 wiz 109 0 83M 59M tstile/1 165:46 1.46% 1.46% systat
0 root 0 0 0K 93M CPU/31 35:39 0.00% 0.00% [system]
219 root 85 0 32M 2676K kqueue/4 7:34 0.00% 0.00% syslogd
13354 wiz 85 0 89M 4948K select/0 0:52 0.00% 0.00% sshd
380 root 85 0 30M 16M pause/4 0:04 0.00% 0.00% ntpd
10918 wiz 43 0 25M 2872K CPU/3 0:01 0.00% 0.00% top
1 root 85 0 20M 1756K wait/29 0:01 0.00% 0.00% init
5594 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam
22861 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam
747 root 117 0 20M 2080K tstile/8 0:00 0.00% 0.00% cron
16473 pbulk 117 0 18M 1564K tstile/2 0:00 0.00% 0.00% cp
9705 pbulk 117 0 15M 1564K bioloc/5 0:00 0.00% 0.00% cp
7301 pbulk 117 0 15M 1560K tstile/2 0:00 0.00% 0.00% cp
22971 pbulk 117 0 19M 1520K tstile/1 0:00 0.00% 0.00% cp
10013 pbulk 117 0 15M 1520K tstile/1 0:00 0.00% 0.00% cp
3411 pbulk 117 0 15M 1520K tstile/3 0:00 0.00% 0.00% cp
5212 pbulk 117 0 15M 1520K tstile/2 0:00 0.00% 0.00% cp
7072 pbulk 117 0 18M 1516K tstile/2 0:00 0.00% 0.00% cp
8880 pbulk 117 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp
5869 pbulk 117 0 15M 1516K tstile/0 0:00 0.00% 0.00% cp
10159 pbulk 117 0 15M 1516K tstile/1 0:00 0.00% 0.00% cp
11783 pbulk 117 0 15M 1516K tstile/7 0:00 0.00% 0.00% cp
7205 pbulk 117 0 15M 1512K tstile/1 0:00 0.00% 0.00% cp
18676 pbulk 109 0 15M 1516K tstile/3 0:00 0.00% 0.00% cp
7802 pbulk 109 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp
622 pbulk 109 0 15M 1512K tstile/2 0:00 0.00% 0.00% cp
29434 pbulk 109 0 9576K 680K tstile/2 0:00 0.00% 0.00% cp
2686 root 85 0 86M 6824K select/2 0:00 0.00% 0.00% sshd
10052 root 85 0 89M 6784K select/2 0:00 0.00% 0.00% sshd
674 root 85 0 70M 5056K wait/18 0:00 0.00% 0.00% login
19345 wiz 85 0 86M 4960K select/3 0:00 0.00% 0.00% sshd
652 postfix 85 0 57M 4848K kqueue/4 0:00 0.00% 0.00% qmgr
4466 postfix 85 0 59M 4560K kqueue/0 0:00 0.00% 0.00% pickup
441 root 85 0 70M 3412K select/2 0:00 0.00% 0.00% sshd
656 root 85 0 57M 3328K kqueue/0 0:00 0.00% 0.00% master
278 root 85 0 45M 2232K nfsd/31 0:00 0.00% 0.00% nfsd
639 root 85 0 16M 2128K pause/0 0:00 0.00% 0.00% ksh
21402 root 85 0 20M 1988K wait/0 0:00 0.00% 0.00% sh
23371 root 85 0 20M 1972K wait/0 0:00 0.00% 0.00% sh
3940 wiz 85 0 16M 1948K pause/23 0:00 0.00% 0.00% ksh
8843 wiz 85 0 16M 1948K pause/5 0:00 0.00% 0.00% ksh
227 root 85 0 20M 1940K select/1 0:00 0.00% 0.00% rpcbind
698 root 85 0 20M 1836K ttyraw/3 0:00 0.00% 0.00% getty
542 root 85 0 20M 1832K ttyraw/2 0:00 0.00% 0.00% getty
535 root 85 0 20M 1832K ttyraw/0 0:00 0.00% 0.00% getty
531 root 85 0 25M 1644K kqueue/3 0:00 0.00% 0.00% inetd
329 root 85 0 24M 1524K select/2 0:00 0.00% 0.00% mountd
436 root 85 0 20M 1516K kqueue/2 0:00 0.00% 0.00% powerd
On the console I see that it's currently trying to build
boost-headers, so it's not even something compile-heavy.
The machine is still in this state and I have a PS/2 keyboard
attached, so let me know if you want to check something out.
I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48).
The kernel config is
include "arch/amd64/conf/GENERIC"
options FONT_GO_MONO12x23
no options FONT_BOLD16x32
no options FONT_BOLD8x16
It's a 16-core AMD Threadripper system with 128GB RAM.
What could go wrong here? I'm running out of ideas.
Thomas