Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: recurring tstile hangs on -current
When I tried to get a core, I saw:
> reboot 0x104
dumping to dev 168,2 (offset=73677660, size=33524130)
dump ahcisata0 port 5: clearing WDCTL_RST failed for drive 0
wddump: device timed out
i/o error
rebooting...
Thomas
On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote:
> Hi Frank!
>
> I checked some process states in ddb.
>
> "master", the 2 "bjam" and at least one "cp" hanging in tstile have:
> sleepq_block()
> turnstile_block()
> rw_vector_enter()
> genfs_lock()
> VOP_LOCK()
> vn_lock()
> namei_tryemulroot()
> namei()
> check_exec()
> execve_loadvm()
> execve1()
> syscall()
>
> These look quite similar to your backtraces.
>
> The "cp" hanging in biolock has:
> sleepq_block
> cv_timedwait
> bbusy
> getblk
> bio_doread
> ffs_init_vnode
> ffs_newvnode
> vcache_new
> ufs_makeinode
> ufs_create
> VOP_CREATE
> vn_open
> do_open
> do_sys_openat
> sys_open
> syscall
>
> I can't agree with the statement that it's a general -current problem
> -- my current working machine does not have this issue. It "only" has
> 32GB and 12 cores though, and no nvme. dmesg attached.
>
> Do you see the issue on machines without nvme? Just to eliminate that.
> (I wanted to try replacing the nvme boot disk next.)
> Thomas
>
>
> On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> > Hi Thomas,
> >
> > glad that this is observed elsewhere.
> >
> > Maybe following bugs could resonate with your observations:
> >
> > kern/54207 [serious/high]:
> > -current locks up solidly when pkgsrc building adapta-gtk-theme-3.95.0.11
> > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
> >
> > kern/54210 [serious/high]:
> > NetBSD-8 processes presumably not exiting
> > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, ~380Gb)
> >
> > At this time I am not too confident, that -current is reliably able to do a pkgsrc build, though I have seen occasionally bulk builds that did finish.
> > Most of the time I run into hard lockups with no information about the system state available (no console, no X, no network, no DDB).
> >
> > Frank
> >
> >
> > On 06/28/19 10:46, Thomas Klausner wrote:
> > > Hi!
> > >
> > > I've set up a new machine for bulk building. I have tried various
> > > things, but in the end it always hangs in tstile.
> > >
> > > First try was what I currently use: tmpfs sandboxes with nullfs
> > > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > > nullfs' fault. (The same setup works fine on my current machine.)
> > >
> > > The second try was tmpfs with copied-in /bin, /lib, ... and
> > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > > hung. So the suspicion was that tmpfs or NFS are broken.
> > >
> > > The last try was building in the root file system, i.e. not even a
> > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > > finished one pkgsrc build (where some packages didn't build because of
> > > missing distfiles, or because they randomly break like rust). When I
> > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> > >
> > > I have the following systat output:
> > >
> > > 2 users Load 8.78 7.19 3.62 Fri Jun 28 04:27:32
> > >
> > > Proc:r d s Csw Traps SysCal Intr Soft Fault PAGING SWAPPING
> > > 24 10 7548 265849 157956 3504 2399 265476 in out in out
> > > ops
> > > 56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Id pages
> > > | | | | | | | | | | |
> > > ============================> 670 forks
> > > fkppw
> > > Anon 294104 % zero 62161268 5572 Interrupts fksvm
> > > Exec 14116 % wired 16296 1968 TLB shootdown pwait
> > > File 24587740 18% inact 43756 100 cpu0 timer relck
> > > Meta 2606694 % bufs 495676 msi1 vec 0 rlkok
> > > (kB) real swaponly free 9 msix2 vec 0 noram
> > > Active 24835908 100033996 9 msix2 vec 1 57262 ndcpy
> > > Namei Sys-cache Proc-cache msix2 vec 2 27906 fltcp
> > > Calls hits % hits % 3427 ioapic1 pin 12 87178 zfod
> > > 125076 122834 98 80 0 59 ioapic2 pin 0 35775 cow
> > > msix7 vec 0 8192 fmin
> > > Disks: seeks xfers bytes %busy 10922 ftarg
> > > ld0 1969 16130K 34.8 itarg
> > > dk0 1969 16130K 34.8 flnan
> > > wd0 pdfre
> > > dk1 pdscn
> > > dk2
> > >
> > > and this from top:
> > >
> > > load averages: 5.13, 6.53, 3.56; up 1+16:08:05 04:28:13
> > > 59 processes: 2 runnable, 55 sleeping, 2 on CPU
> > > CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle
> > > Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
> > > Swap: 163G Total, 163G Free
> > >
> > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
> > > 10353 pbulk 77 0 185M 172M select/0 0:13 4.74% 4.54% bjam
> > > 12120 wiz 109 0 83M 59M tstile/1 165:46 1.46% 1.46% systat
> > > 0 root 0 0 0K 93M CPU/31 35:39 0.00% 0.00% [system]
> > > 219 root 85 0 32M 2676K kqueue/4 7:34 0.00% 0.00% syslogd
> > > 13354 wiz 85 0 89M 4948K select/0 0:52 0.00% 0.00% sshd
> > > 380 root 85 0 30M 16M pause/4 0:04 0.00% 0.00% ntpd
> > > 10918 wiz 43 0 25M 2872K CPU/3 0:01 0.00% 0.00% top
> > > 1 root 85 0 20M 1756K wait/29 0:01 0.00% 0.00% init
> > > 5594 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam
> > > 22861 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam
> > > 747 root 117 0 20M 2080K tstile/8 0:00 0.00% 0.00% cron
> > > 16473 pbulk 117 0 18M 1564K tstile/2 0:00 0.00% 0.00% cp
> > > 9705 pbulk 117 0 15M 1564K bioloc/5 0:00 0.00% 0.00% cp
> > > 7301 pbulk 117 0 15M 1560K tstile/2 0:00 0.00% 0.00% cp
> > > 22971 pbulk 117 0 19M 1520K tstile/1 0:00 0.00% 0.00% cp
> > > 10013 pbulk 117 0 15M 1520K tstile/1 0:00 0.00% 0.00% cp
> > > 3411 pbulk 117 0 15M 1520K tstile/3 0:00 0.00% 0.00% cp
> > > 5212 pbulk 117 0 15M 1520K tstile/2 0:00 0.00% 0.00% cp
> > > 7072 pbulk 117 0 18M 1516K tstile/2 0:00 0.00% 0.00% cp
> > > 8880 pbulk 117 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp
> > > 5869 pbulk 117 0 15M 1516K tstile/0 0:00 0.00% 0.00% cp
> > > 10159 pbulk 117 0 15M 1516K tstile/1 0:00 0.00% 0.00% cp
> > > 11783 pbulk 117 0 15M 1516K tstile/7 0:00 0.00% 0.00% cp
> > > 7205 pbulk 117 0 15M 1512K tstile/1 0:00 0.00% 0.00% cp
> > > 18676 pbulk 109 0 15M 1516K tstile/3 0:00 0.00% 0.00% cp
> > > 7802 pbulk 109 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp
> > > 622 pbulk 109 0 15M 1512K tstile/2 0:00 0.00% 0.00% cp
> > > 29434 pbulk 109 0 9576K 680K tstile/2 0:00 0.00% 0.00% cp
> > > 2686 root 85 0 86M 6824K select/2 0:00 0.00% 0.00% sshd
> > > 10052 root 85 0 89M 6784K select/2 0:00 0.00% 0.00% sshd
> > > 674 root 85 0 70M 5056K wait/18 0:00 0.00% 0.00% login
> > > 19345 wiz 85 0 86M 4960K select/3 0:00 0.00% 0.00% sshd
> > > 652 postfix 85 0 57M 4848K kqueue/4 0:00 0.00% 0.00% qmgr
> > > 4466 postfix 85 0 59M 4560K kqueue/0 0:00 0.00% 0.00% pickup
> > > 441 root 85 0 70M 3412K select/2 0:00 0.00% 0.00% sshd
> > > 656 root 85 0 57M 3328K kqueue/0 0:00 0.00% 0.00% master
> > > 278 root 85 0 45M 2232K nfsd/31 0:00 0.00% 0.00% nfsd
> > > 639 root 85 0 16M 2128K pause/0 0:00 0.00% 0.00% ksh
> > > 21402 root 85 0 20M 1988K wait/0 0:00 0.00% 0.00% sh
> > > 23371 root 85 0 20M 1972K wait/0 0:00 0.00% 0.00% sh
> > > 3940 wiz 85 0 16M 1948K pause/23 0:00 0.00% 0.00% ksh
> > > 8843 wiz 85 0 16M 1948K pause/5 0:00 0.00% 0.00% ksh
> > > 227 root 85 0 20M 1940K select/1 0:00 0.00% 0.00% rpcbind
> > > 698 root 85 0 20M 1836K ttyraw/3 0:00 0.00% 0.00% getty
> > > 542 root 85 0 20M 1832K ttyraw/2 0:00 0.00% 0.00% getty
> > > 535 root 85 0 20M 1832K ttyraw/0 0:00 0.00% 0.00% getty
> > > 531 root 85 0 25M 1644K kqueue/3 0:00 0.00% 0.00% inetd
> > > 329 root 85 0 24M 1524K select/2 0:00 0.00% 0.00% mountd
> > > 436 root 85 0 20M 1516K kqueue/2 0:00 0.00% 0.00% powerd
> > >
> > > On the console I see that it's currently trying to build
> > > boost-headers, so it's not even something compile-heavy.
> > >
> > > The machine is still in this state and I have a PS/2 keyboard
> > > attached, so let me know if you want to check something out.
> > >
> > > I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48).
> > > The kernel config is
> > >
> > > include "arch/amd64/conf/GENERIC"
> > > options FONT_GO_MONO12x23
> > > no options FONT_BOLD16x32
> > > no options FONT_BOLD8x16
> > >
> > > It's a 16-core AMD Threadripper system with 128GB RAM.
> > >
> > > What could go wrong here? I'm running out of ideas.
> > > Thomas
> >
>
Home |
Main Index |
Thread Index |
Old Index