Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: recurring tstile hangs on -current



When I tried to get a core, I saw:

> reboot 0x104

dumping to dev 168,2 (offset=73677660, size=33524130)
dump ahcisata0 port 5: clearing WDCTL_RST failed for drive 0
wddump: device timed out
i/o error


rebooting...

 Thomas

On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote:
> Hi Frank!
> 
> I checked some process states in ddb.
> 
> "master", the 2 "bjam" and at least one "cp" hanging in tstile have:
> sleepq_block()
> turnstile_block()
> rw_vector_enter()
> genfs_lock()
> VOP_LOCK()
> vn_lock()
> namei_tryemulroot()
> namei()
> check_exec()
> execve_loadvm()
> execve1()
> syscall()
> 
> These look quite similar to your backtraces.
> 
> The "cp" hanging in biolock has:
> sleepq_block
> cv_timedwait
> bbusy
> getblk
> bio_doread
> ffs_init_vnode
> ffs_newvnode
> vcache_new
> ufs_makeinode
> ufs_create
> VOP_CREATE
> vn_open
> do_open
> do_sys_openat
> sys_open
> syscall
> 
> I can't agree with the statement that it's a general -current problem
> -- my current working machine does not have this issue. It "only" has
> 32GB and 12 cores though, and no nvme. dmesg attached.
> 
> Do you see the issue on machines without nvme? Just to eliminate that.
> (I wanted to try replacing the nvme boot disk next.)
>  Thomas
> 
> 
> On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> > Hi Thomas,
> > 
> > glad that this is observed elsewhere.
> > 
> > Maybe following bugs could resonate with your observations:
> > 
> > kern/54207 [serious/high]:
> >         -current locks up solidly when pkgsrc building adapta-gtk-theme-3.95.0.11
> > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
> > 
> > kern/54210 [serious/high]:
> >         NetBSD-8 processes presumably not exiting
> > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, ~380Gb)
> > 
> > At this time I am not too confident, that -current is reliably able to do a pkgsrc build, though I have seen occasionally bulk builds that did finish.
> > Most of the time I run into hard lockups with no information about the system state available (no console, no X, no network, no DDB).
> > 
> > Frank
> > 
> > 
> > On 06/28/19 10:46, Thomas Klausner wrote:
> > > Hi!
> > > 
> > > I've set up a new machine for bulk building. I have tried various
> > > things, but in the end it always hangs in tstile.
> > > 
> > > First try was what I currently use: tmpfs sandboxes with nullfs
> > > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > > nullfs' fault. (The same setup works fine on my current machine.)
> > > 
> > > The second try was tmpfs with copied-in /bin, /lib, ... and
> > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > > hung. So the suspicion was that tmpfs or NFS are broken.
> > > 
> > > The last try was building in the root file system, i.e. not even a
> > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > > finished one pkgsrc build (where some packages didn't build because of
> > > missing distfiles, or because they randomly break like rust). When I
> > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> > > 
> > > I have the following systat output:
> > > 
> > >      2 users    Load  8.78  7.19  3.62                  Fri Jun 28 04:27:32
> > > 
> > > Proc:r  d  s        Csw  Traps SysCal  Intr   Soft  Fault     PAGING   SWAPPING
> > >      24    10       7548 265849 157956  3504   2399 265476     in  out   in  out
> > >                                                          ops
> > >    56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Id    pages
> > > |    |    |    |    |    |    |    |    |    |    |
> > > ============================>                                         670 forks
> > >                                                                            fkppw
> > > Anon       294104    %   zero 62161268      5572 Interrupts               fksvm
> > > Exec        14116    %   wired   16296      1968 TLB shootdown            pwait
> > > File     24587740  18%   inact   43756       100 cpu0 timer               relck
> > > Meta      2606694    %   bufs   495676           msi1 vec 0               rlkok
> > >   (kB)        real   swaponly      free         9 msix2 vec 0              noram
> > > Active   24835908            100033996         9 msix2 vec 1        57262 ndcpy
> > > Namei         Sys-cache     Proc-cache           msix2 vec 2        27906 fltcp
> > >      Calls     hits    %     hits     %      3427 ioapic1 pin 12     87178 zfod
> > >     125076   122834   98       80     0        59 ioapic2 pin 0      35775 cow
> > >                                                   msix7 vec 0         8192 fmin
> > >    Disks:   seeks   xfers   bytes   %busy                            10922 ftarg
> > >       ld0            1969  16130K    34.8                                  itarg
> > >       dk0            1969  16130K    34.8                                  flnan
> > >       wd0                                                                  pdfre
> > >       dk1                                                                  pdscn
> > >       dk2
> > > 
> > > and this from top:
> > > 
> > > load averages:  5.13,  6.53,  3.56;               up 1+16:08:05                                                                                                                                                          04:28:13
> > > 59 processes: 2 runnable, 55 sleeping, 2 on CPU
> > > CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
> > > Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
> > > Swap: 163G Total, 163G Free
> > > 
> > >    PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
> > > 10353 pbulk     77    0   185M  172M select/0   0:13  4.74%  4.54% bjam
> > > 12120 wiz      109    0    83M   59M tstile/1 165:46  1.46%  1.46% systat
> > >      0 root       0    0     0K   93M CPU/31    35:39  0.00%  0.00% [system]
> > >    219 root      85    0    32M 2676K kqueue/4   7:34  0.00%  0.00% syslogd
> > > 13354 wiz       85    0    89M 4948K select/0   0:52  0.00%  0.00% sshd
> > >    380 root      85    0    30M   16M pause/4    0:04  0.00%  0.00% ntpd
> > > 10918 wiz       43    0    25M 2872K CPU/3      0:01  0.00%  0.00% top
> > >      1 root      85    0    20M 1756K wait/29    0:01  0.00%  0.00% init
> > >   5594 pbulk      0    0     0K    0K RUN/0      0:00  0.00%  0.00% bjam
> > > 22861 pbulk      0    0     0K    0K RUN/0      0:00  0.00%  0.00% bjam
> > >    747 root     117    0    20M 2080K tstile/8   0:00  0.00%  0.00% cron
> > > 16473 pbulk    117    0    18M 1564K tstile/2   0:00  0.00%  0.00% cp
> > >   9705 pbulk    117    0    15M 1564K bioloc/5   0:00  0.00%  0.00% cp
> > >   7301 pbulk    117    0    15M 1560K tstile/2   0:00  0.00%  0.00% cp
> > > 22971 pbulk    117    0    19M 1520K tstile/1   0:00  0.00%  0.00% cp
> > > 10013 pbulk    117    0    15M 1520K tstile/1   0:00  0.00%  0.00% cp
> > >   3411 pbulk    117    0    15M 1520K tstile/3   0:00  0.00%  0.00% cp
> > >   5212 pbulk    117    0    15M 1520K tstile/2   0:00  0.00%  0.00% cp
> > >   7072 pbulk    117    0    18M 1516K tstile/2   0:00  0.00%  0.00% cp
> > >   8880 pbulk    117    0    15M 1516K tstile/2   0:00  0.00%  0.00% cp
> > >   5869 pbulk    117    0    15M 1516K tstile/0   0:00  0.00%  0.00% cp
> > > 10159 pbulk    117    0    15M 1516K tstile/1   0:00  0.00%  0.00% cp
> > > 11783 pbulk    117    0    15M 1516K tstile/7   0:00  0.00%  0.00% cp
> > >   7205 pbulk    117    0    15M 1512K tstile/1   0:00  0.00%  0.00% cp
> > > 18676 pbulk    109    0    15M 1516K tstile/3   0:00  0.00%  0.00% cp
> > >   7802 pbulk    109    0    15M 1516K tstile/2   0:00  0.00%  0.00% cp
> > >    622 pbulk    109    0    15M 1512K tstile/2   0:00  0.00%  0.00% cp
> > > 29434 pbulk    109    0  9576K  680K tstile/2   0:00  0.00%  0.00% cp
> > >   2686 root      85    0    86M 6824K select/2   0:00  0.00%  0.00% sshd
> > > 10052 root      85    0    89M 6784K select/2   0:00  0.00%  0.00% sshd
> > >    674 root      85    0    70M 5056K wait/18    0:00  0.00%  0.00% login
> > > 19345 wiz       85    0    86M 4960K select/3   0:00  0.00%  0.00% sshd
> > >    652 postfix   85    0    57M 4848K kqueue/4   0:00  0.00%  0.00% qmgr
> > >   4466 postfix   85    0    59M 4560K kqueue/0   0:00  0.00%  0.00% pickup
> > >    441 root      85    0    70M 3412K select/2   0:00  0.00%  0.00% sshd
> > >    656 root      85    0    57M 3328K kqueue/0   0:00  0.00%  0.00% master
> > >    278 root      85    0    45M 2232K nfsd/31    0:00  0.00%  0.00% nfsd
> > >    639 root      85    0    16M 2128K pause/0    0:00  0.00%  0.00% ksh
> > > 21402 root      85    0    20M 1988K wait/0     0:00  0.00%  0.00% sh
> > > 23371 root      85    0    20M 1972K wait/0     0:00  0.00%  0.00% sh
> > >   3940 wiz       85    0    16M 1948K pause/23   0:00  0.00%  0.00% ksh
> > >   8843 wiz       85    0    16M 1948K pause/5    0:00  0.00%  0.00% ksh
> > >    227 root      85    0    20M 1940K select/1   0:00  0.00%  0.00% rpcbind
> > >    698 root      85    0    20M 1836K ttyraw/3   0:00  0.00%  0.00% getty
> > >    542 root      85    0    20M 1832K ttyraw/2   0:00  0.00%  0.00% getty
> > >    535 root      85    0    20M 1832K ttyraw/0   0:00  0.00%  0.00% getty
> > >    531 root      85    0    25M 1644K kqueue/3   0:00  0.00%  0.00% inetd
> > >    329 root      85    0    24M 1524K select/2   0:00  0.00%  0.00% mountd
> > >    436 root      85    0    20M 1516K kqueue/2   0:00  0.00%  0.00% powerd
> > > 
> > > On the console I see that it's currently trying to build
> > > boost-headers, so it's not even something compile-heavy.
> > > 
> > > The machine is still in this state and I have a PS/2 keyboard
> > > attached, so let me know if you want to check something out.
> > > 
> > > I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48).
> > > The kernel config is
> > > 
> > > include "arch/amd64/conf/GENERIC"
> > > options FONT_GO_MONO12x23
> > > no options FONT_BOLD16x32
> > > no options FONT_BOLD8x16
> > > 
> > > It's a 16-core AMD Threadripper system with 128GB RAM.
> > > 
> > > What could go wrong here? I'm running out of ideas.
> > >   Thomas
> > 
> 


Home | Main Index | Thread Index | Old Index