Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: recurring tstile hangs on -current
Hi Frank!
I checked some process states in ddb.
"master", the 2 "bjam" and at least one "cp" hanging in tstile have:
sleepq_block()
turnstile_block()
rw_vector_enter()
genfs_lock()
VOP_LOCK()
vn_lock()
namei_tryemulroot()
namei()
check_exec()
execve_loadvm()
execve1()
syscall()
These look quite similar to your backtraces.
The "cp" hanging in biolock has:
sleepq_block
cv_timedwait
bbusy
getblk
bio_doread
ffs_init_vnode
ffs_newvnode
vcache_new
ufs_makeinode
ufs_create
VOP_CREATE
vn_open
do_open
do_sys_openat
sys_open
syscall
I can't agree with the statement that it's a general -current problem
-- my current working machine does not have this issue. It "only" has
32GB and 12 cores though, and no nvme. dmesg attached.
Do you see the issue on machines without nvme? Just to eliminate that.
(I wanted to try replacing the nvme boot disk next.)
Thomas
On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> Hi Thomas,
>
> glad that this is observed elsewhere.
>
> Maybe following bugs could resonate with your observations:
>
> kern/54207 [serious/high]:
> -current locks up solidly when pkgsrc building adapta-gtk-theme-3.95.0.11
> looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
>
> kern/54210 [serious/high]:
> NetBSD-8 processes presumably not exiting
> not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, ~380Gb)
>
> At this time I am not too confident, that -current is reliably able to do a pkgsrc build, though I have seen occasionally bulk builds that did finish.
> Most of the time I run into hard lockups with no information about the system state available (no console, no X, no network, no DDB).
>
> Frank
>
>
> On 06/28/19 10:46, Thomas Klausner wrote:
> > Hi!
> >
> > I've set up a new machine for bulk building. I have tried various
> > things, but in the end it always hangs in tstile.
> >
> > First try was what I currently use: tmpfs sandboxes with nullfs
> > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > nullfs' fault. (The same setup works fine on my current machine.)
> >
> > The second try was tmpfs with copied-in /bin, /lib, ... and
> > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > hung. So the suspicion was that tmpfs or NFS are broken.
> >
> > The last try was building in the root file system, i.e. not even a
> > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > finished one pkgsrc build (where some packages didn't build because of
> > missing distfiles, or because they randomly break like rust). When I
> > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> >
> > I have the following systat output:
> >
> > 2 users Load 8.78 7.19 3.62 Fri Jun 28 04:27:32
> >
> > Proc:r d s Csw Traps SysCal Intr Soft Fault PAGING SWAPPING
> > 24 10 7548 265849 157956 3504 2399 265476 in out in out
> > ops
> > 56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Id pages
> > | | | | | | | | | | |
> > ============================> 670 forks
> > fkppw
> > Anon 294104 % zero 62161268 5572 Interrupts fksvm
> > Exec 14116 % wired 16296 1968 TLB shootdown pwait
> > File 24587740 18% inact 43756 100 cpu0 timer relck
> > Meta 2606694 % bufs 495676 msi1 vec 0 rlkok
> > (kB) real swaponly free 9 msix2 vec 0 noram
> > Active 24835908 100033996 9 msix2 vec 1 57262 ndcpy
> > Namei Sys-cache Proc-cache msix2 vec 2 27906 fltcp
> > Calls hits % hits % 3427 ioapic1 pin 12 87178 zfod
> > 125076 122834 98 80 0 59 ioapic2 pin 0 35775 cow
> > msix7 vec 0 8192 fmin
> > Disks: seeks xfers bytes %busy 10922 ftarg
> > ld0 1969 16130K 34.8 itarg
> > dk0 1969 16130K 34.8 flnan
> > wd0 pdfre
> > dk1 pdscn
> > dk2
> >
> > and this from top:
> >
> > load averages: 5.13, 6.53, 3.56; up 1+16:08:05 04:28:13
> > 59 processes: 2 runnable, 55 sleeping, 2 on CPU
> > CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle
> > Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
> > Swap: 163G Total, 163G Free
> >
> > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
> > 10353 pbulk 77 0 185M 172M select/0 0:13 4.74% 4.54% bjam
> > 12120 wiz 109 0 83M 59M tstile/1 165:46 1.46% 1.46% systat
> > 0 root 0 0 0K 93M CPU/31 35:39 0.00% 0.00% [system]
> > 219 root 85 0 32M 2676K kqueue/4 7:34 0.00% 0.00% syslogd
> > 13354 wiz 85 0 89M 4948K select/0 0:52 0.00% 0.00% sshd
> > 380 root 85 0 30M 16M pause/4 0:04 0.00% 0.00% ntpd
> > 10918 wiz 43 0 25M 2872K CPU/3 0:01 0.00% 0.00% top
> > 1 root 85 0 20M 1756K wait/29 0:01 0.00% 0.00% init
> > 5594 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam
> > 22861 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam
> > 747 root 117 0 20M 2080K tstile/8 0:00 0.00% 0.00% cron
> > 16473 pbulk 117 0 18M 1564K tstile/2 0:00 0.00% 0.00% cp
> > 9705 pbulk 117 0 15M 1564K bioloc/5 0:00 0.00% 0.00% cp
> > 7301 pbulk 117 0 15M 1560K tstile/2 0:00 0.00% 0.00% cp
> > 22971 pbulk 117 0 19M 1520K tstile/1 0:00 0.00% 0.00% cp
> > 10013 pbulk 117 0 15M 1520K tstile/1 0:00 0.00% 0.00% cp
> > 3411 pbulk 117 0 15M 1520K tstile/3 0:00 0.00% 0.00% cp
> > 5212 pbulk 117 0 15M 1520K tstile/2 0:00 0.00% 0.00% cp
> > 7072 pbulk 117 0 18M 1516K tstile/2 0:00 0.00% 0.00% cp
> > 8880 pbulk 117 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp
> > 5869 pbulk 117 0 15M 1516K tstile/0 0:00 0.00% 0.00% cp
> > 10159 pbulk 117 0 15M 1516K tstile/1 0:00 0.00% 0.00% cp
> > 11783 pbulk 117 0 15M 1516K tstile/7 0:00 0.00% 0.00% cp
> > 7205 pbulk 117 0 15M 1512K tstile/1 0:00 0.00% 0.00% cp
> > 18676 pbulk 109 0 15M 1516K tstile/3 0:00 0.00% 0.00% cp
> > 7802 pbulk 109 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp
> > 622 pbulk 109 0 15M 1512K tstile/2 0:00 0.00% 0.00% cp
> > 29434 pbulk 109 0 9576K 680K tstile/2 0:00 0.00% 0.00% cp
> > 2686 root 85 0 86M 6824K select/2 0:00 0.00% 0.00% sshd
> > 10052 root 85 0 89M 6784K select/2 0:00 0.00% 0.00% sshd
> > 674 root 85 0 70M 5056K wait/18 0:00 0.00% 0.00% login
> > 19345 wiz 85 0 86M 4960K select/3 0:00 0.00% 0.00% sshd
> > 652 postfix 85 0 57M 4848K kqueue/4 0:00 0.00% 0.00% qmgr
> > 4466 postfix 85 0 59M 4560K kqueue/0 0:00 0.00% 0.00% pickup
> > 441 root 85 0 70M 3412K select/2 0:00 0.00% 0.00% sshd
> > 656 root 85 0 57M 3328K kqueue/0 0:00 0.00% 0.00% master
> > 278 root 85 0 45M 2232K nfsd/31 0:00 0.00% 0.00% nfsd
> > 639 root 85 0 16M 2128K pause/0 0:00 0.00% 0.00% ksh
> > 21402 root 85 0 20M 1988K wait/0 0:00 0.00% 0.00% sh
> > 23371 root 85 0 20M 1972K wait/0 0:00 0.00% 0.00% sh
> > 3940 wiz 85 0 16M 1948K pause/23 0:00 0.00% 0.00% ksh
> > 8843 wiz 85 0 16M 1948K pause/5 0:00 0.00% 0.00% ksh
> > 227 root 85 0 20M 1940K select/1 0:00 0.00% 0.00% rpcbind
> > 698 root 85 0 20M 1836K ttyraw/3 0:00 0.00% 0.00% getty
> > 542 root 85 0 20M 1832K ttyraw/2 0:00 0.00% 0.00% getty
> > 535 root 85 0 20M 1832K ttyraw/0 0:00 0.00% 0.00% getty
> > 531 root 85 0 25M 1644K kqueue/3 0:00 0.00% 0.00% inetd
> > 329 root 85 0 24M 1524K select/2 0:00 0.00% 0.00% mountd
> > 436 root 85 0 20M 1516K kqueue/2 0:00 0.00% 0.00% powerd
> >
> > On the console I see that it's currently trying to build
> > boost-headers, so it's not even something compile-heavy.
> >
> > The machine is still in this state and I have a PS/2 keyboard
> > attached, so let me know if you want to check something out.
> >
> > I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48).
> > The kernel config is
> >
> > include "arch/amd64/conf/GENERIC"
> > options FONT_GO_MONO12x23
> > no options FONT_BOLD16x32
> > no options FONT_BOLD8x16
> >
> > It's a 16-core AMD Threadripper system with 128GB RAM.
> >
> > What could go wrong here? I'm running out of ideas.
> > Thomas
>
Home |
Main Index |
Thread Index |
Old Index