Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: recurring tstile hangs on -current



Hi Frank!

I checked some process states in ddb.

"master", the 2 "bjam" and at least one "cp" hanging in tstile have:
sleepq_block()
turnstile_block()
rw_vector_enter()
genfs_lock()
VOP_LOCK()
vn_lock()
namei_tryemulroot()
namei()
check_exec()
execve_loadvm()
execve1()
syscall()

These look quite similar to your backtraces.

The "cp" hanging in biolock has:
sleepq_block
cv_timedwait
bbusy
getblk
bio_doread
ffs_init_vnode
ffs_newvnode
vcache_new
ufs_makeinode
ufs_create
VOP_CREATE
vn_open
do_open
do_sys_openat
sys_open
syscall

I can't agree with the statement that it's a general -current problem
-- my current working machine does not have this issue. It "only" has
32GB and 12 cores though, and no nvme. dmesg attached.

Do you see the issue on machines without nvme? Just to eliminate that.
(I wanted to try replacing the nvme boot disk next.)
 Thomas


On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> Hi Thomas,
> 
> glad that this is observed elsewhere.
> 
> Maybe following bugs could resonate with your observations:
> 
> kern/54207 [serious/high]:
>         -current locks up solidly when pkgsrc building adapta-gtk-theme-3.95.0.11
> looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
> 
> kern/54210 [serious/high]:
>         NetBSD-8 processes presumably not exiting
> not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, ~380Gb)
> 
> At this time I am not too confident, that -current is reliably able to do a pkgsrc build, though I have seen occasionally bulk builds that did finish.
> Most of the time I run into hard lockups with no information about the system state available (no console, no X, no network, no DDB).
> 
> Frank
> 
> 
> On 06/28/19 10:46, Thomas Klausner wrote:
> > Hi!
> > 
> > I've set up a new machine for bulk building. I have tried various
> > things, but in the end it always hangs in tstile.
> > 
> > First try was what I currently use: tmpfs sandboxes with nullfs
> > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > nullfs' fault. (The same setup works fine on my current machine.)
> > 
> > The second try was tmpfs with copied-in /bin, /lib, ... and
> > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > hung. So the suspicion was that tmpfs or NFS are broken.
> > 
> > The last try was building in the root file system, i.e. not even a
> > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > finished one pkgsrc build (where some packages didn't build because of
> > missing distfiles, or because they randomly break like rust). When I
> > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> > 
> > I have the following systat output:
> > 
> >      2 users    Load  8.78  7.19  3.62                  Fri Jun 28 04:27:32
> > 
> > Proc:r  d  s        Csw  Traps SysCal  Intr   Soft  Fault     PAGING   SWAPPING
> >      24    10       7548 265849 157956  3504   2399 265476     in  out   in  out
> >                                                          ops
> >    56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Id    pages
> > |    |    |    |    |    |    |    |    |    |    |
> > ============================>                                         670 forks
> >                                                                            fkppw
> > Anon       294104    %   zero 62161268      5572 Interrupts               fksvm
> > Exec        14116    %   wired   16296      1968 TLB shootdown            pwait
> > File     24587740  18%   inact   43756       100 cpu0 timer               relck
> > Meta      2606694    %   bufs   495676           msi1 vec 0               rlkok
> >   (kB)        real   swaponly      free         9 msix2 vec 0              noram
> > Active   24835908            100033996         9 msix2 vec 1        57262 ndcpy
> > Namei         Sys-cache     Proc-cache           msix2 vec 2        27906 fltcp
> >      Calls     hits    %     hits     %      3427 ioapic1 pin 12     87178 zfod
> >     125076   122834   98       80     0        59 ioapic2 pin 0      35775 cow
> >                                                   msix7 vec 0         8192 fmin
> >    Disks:   seeks   xfers   bytes   %busy                            10922 ftarg
> >       ld0            1969  16130K    34.8                                  itarg
> >       dk0            1969  16130K    34.8                                  flnan
> >       wd0                                                                  pdfre
> >       dk1                                                                  pdscn
> >       dk2
> > 
> > and this from top:
> > 
> > load averages:  5.13,  6.53,  3.56;               up 1+16:08:05                                                                                                                                                          04:28:13
> > 59 processes: 2 runnable, 55 sleeping, 2 on CPU
> > CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
> > Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
> > Swap: 163G Total, 163G Free
> > 
> >    PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
> > 10353 pbulk     77    0   185M  172M select/0   0:13  4.74%  4.54% bjam
> > 12120 wiz      109    0    83M   59M tstile/1 165:46  1.46%  1.46% systat
> >      0 root       0    0     0K   93M CPU/31    35:39  0.00%  0.00% [system]
> >    219 root      85    0    32M 2676K kqueue/4   7:34  0.00%  0.00% syslogd
> > 13354 wiz       85    0    89M 4948K select/0   0:52  0.00%  0.00% sshd
> >    380 root      85    0    30M   16M pause/4    0:04  0.00%  0.00% ntpd
> > 10918 wiz       43    0    25M 2872K CPU/3      0:01  0.00%  0.00% top
> >      1 root      85    0    20M 1756K wait/29    0:01  0.00%  0.00% init
> >   5594 pbulk      0    0     0K    0K RUN/0      0:00  0.00%  0.00% bjam
> > 22861 pbulk      0    0     0K    0K RUN/0      0:00  0.00%  0.00% bjam
> >    747 root     117    0    20M 2080K tstile/8   0:00  0.00%  0.00% cron
> > 16473 pbulk    117    0    18M 1564K tstile/2   0:00  0.00%  0.00% cp
> >   9705 pbulk    117    0    15M 1564K bioloc/5   0:00  0.00%  0.00% cp
> >   7301 pbulk    117    0    15M 1560K tstile/2   0:00  0.00%  0.00% cp
> > 22971 pbulk    117    0    19M 1520K tstile/1   0:00  0.00%  0.00% cp
> > 10013 pbulk    117    0    15M 1520K tstile/1   0:00  0.00%  0.00% cp
> >   3411 pbulk    117    0    15M 1520K tstile/3   0:00  0.00%  0.00% cp
> >   5212 pbulk    117    0    15M 1520K tstile/2   0:00  0.00%  0.00% cp
> >   7072 pbulk    117    0    18M 1516K tstile/2   0:00  0.00%  0.00% cp
> >   8880 pbulk    117    0    15M 1516K tstile/2   0:00  0.00%  0.00% cp
> >   5869 pbulk    117    0    15M 1516K tstile/0   0:00  0.00%  0.00% cp
> > 10159 pbulk    117    0    15M 1516K tstile/1   0:00  0.00%  0.00% cp
> > 11783 pbulk    117    0    15M 1516K tstile/7   0:00  0.00%  0.00% cp
> >   7205 pbulk    117    0    15M 1512K tstile/1   0:00  0.00%  0.00% cp
> > 18676 pbulk    109    0    15M 1516K tstile/3   0:00  0.00%  0.00% cp
> >   7802 pbulk    109    0    15M 1516K tstile/2   0:00  0.00%  0.00% cp
> >    622 pbulk    109    0    15M 1512K tstile/2   0:00  0.00%  0.00% cp
> > 29434 pbulk    109    0  9576K  680K tstile/2   0:00  0.00%  0.00% cp
> >   2686 root      85    0    86M 6824K select/2   0:00  0.00%  0.00% sshd
> > 10052 root      85    0    89M 6784K select/2   0:00  0.00%  0.00% sshd
> >    674 root      85    0    70M 5056K wait/18    0:00  0.00%  0.00% login
> > 19345 wiz       85    0    86M 4960K select/3   0:00  0.00%  0.00% sshd
> >    652 postfix   85    0    57M 4848K kqueue/4   0:00  0.00%  0.00% qmgr
> >   4466 postfix   85    0    59M 4560K kqueue/0   0:00  0.00%  0.00% pickup
> >    441 root      85    0    70M 3412K select/2   0:00  0.00%  0.00% sshd
> >    656 root      85    0    57M 3328K kqueue/0   0:00  0.00%  0.00% master
> >    278 root      85    0    45M 2232K nfsd/31    0:00  0.00%  0.00% nfsd
> >    639 root      85    0    16M 2128K pause/0    0:00  0.00%  0.00% ksh
> > 21402 root      85    0    20M 1988K wait/0     0:00  0.00%  0.00% sh
> > 23371 root      85    0    20M 1972K wait/0     0:00  0.00%  0.00% sh
> >   3940 wiz       85    0    16M 1948K pause/23   0:00  0.00%  0.00% ksh
> >   8843 wiz       85    0    16M 1948K pause/5    0:00  0.00%  0.00% ksh
> >    227 root      85    0    20M 1940K select/1   0:00  0.00%  0.00% rpcbind
> >    698 root      85    0    20M 1836K ttyraw/3   0:00  0.00%  0.00% getty
> >    542 root      85    0    20M 1832K ttyraw/2   0:00  0.00%  0.00% getty
> >    535 root      85    0    20M 1832K ttyraw/0   0:00  0.00%  0.00% getty
> >    531 root      85    0    25M 1644K kqueue/3   0:00  0.00%  0.00% inetd
> >    329 root      85    0    24M 1524K select/2   0:00  0.00%  0.00% mountd
> >    436 root      85    0    20M 1516K kqueue/2   0:00  0.00%  0.00% powerd
> > 
> > On the console I see that it's currently trying to build
> > boost-headers, so it's not even something compile-heavy.
> > 
> > The machine is still in this state and I have a PS/2 keyboard
> > attached, so let me know if you want to check something out.
> > 
> > I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48).
> > The kernel config is
> > 
> > include "arch/amd64/conf/GENERIC"
> > options FONT_GO_MONO12x23
> > no options FONT_BOLD16x32
> > no options FONT_BOLD8x16
> > 
> > It's a 16-core AMD Threadripper system with 128GB RAM.
> > 
> > What could go wrong here? I'm running out of ideas.
> >   Thomas
> 


Home | Main Index | Thread Index | Old Index