NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/57198: after heavy i/o, system stalls with many processes stuck in tstile state
>Number: 57198
>Category: kern
>Synopsis: after heavy i/o, system stalls with many processes stuck in tstile state
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jan 25 03:10:01 +0000 2023
>Originator: Tyler Mitchell
>Release: NetBSD 10.99.2 (kernel) / 9.3 (userland)
>Organization:
>Environment:
NetBSD aldea 10.99.2 NetBSD 10.99.2 (GENERIC) #0: Mon Jan 23 13:40:04 UTC 2023 mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/sparc64/compile/GENERIC sparc64
>Description:
I am running NetBSD/sparc64 as a VM using QEMU 7.2.0. Originally I installed NetBSD 9.3, but replaced the kernel with a daily HEAD build from 23-jan-2023.
The host system is macOS Ventura 13.1 (22C65) on a MacBook Pro M1 (arm64). QEMU was installed via homebrew package.
The system is mostly stable, however, after periods of heavy I/O, the system will become unresponsive (eg launching a new command will just hang forever). I have been able to reliably trigger this condition by compiling a pkgsrc package while running /etc/daily.
I left top(1) running and noticed some processes were in the "tstile" state. Here is a ddb ps and backtrace of one of those processes:
db{0}> bt
sparc_interrupt(40250320, 404150c0, 40415530, 40250320, 0, 0) at netbsd:sparc_interrupt+0x294
db{0}> ps/w
PID LID COMMAND EMUL PRI WAIT-MSG WAIT-CHANNEL
1904 >1904 curl netbsd 43 0
1641 1641 utmp_update netbsd 43 tstile 104141678
424 424 sh netbsd 43 ttyout 105e7bc38
423 423 find netbsd 33 biowait 106fad7d0
1824 1824 sh netbsd 40 wait 104e58758
479 479 sh netbsd 43 wait 104e591d8
1116 1116 sh netbsd 42 wait 103f33558
463 463 tmux netbsd 43 kqueue 10427ea60
457 457 ninja netbsd 42 biowait 106fad6b0
305 305 make netbsd 38 wait 10430e418
493 493 sh netbsd 39 wait 103f331d8
195 195 make netbsd 39 wait 10430fc98
172 172 top netbsd 43 select 1033da840
1331 1331 sh netbsd 43 wait 104e58ad8
1341 1341 tmux netbsd 43 wait 103f32758
1259 1259 login netbsd 39 wait 103f323d8
1504 1504 chronyd netbsd 43 select 1033da840
1173 1173 cron netbsd 43 nanoslp 1046d2ec0
1394 1394 qmgr netbsd 41 kqueue 10427f2a0
1148 1148 pickup netbsd 43 kqueue 1047241a0
1322 1322 master netbsd 43 kqueue 1047250a0
868 868 sshd netbsd 43 select 1033da840
408 408 syslogd netbsd 43 kqueue 104259e20
1 1 init netbsd 38 wait 103f32058
0 475 system netbsd 96 nfsiod 104340808
0 511 system netbsd 96 nfsiod 104340748
0 474 system netbsd 96 nfsiod 104340688
0 470 system netbsd 96 nfsiod 1043405c8
0 89 system netbsd 123 physiod 103f31108
0 103 system netbsd 125 pooldrain 1cbc938
0 102 system netbsd 124 biowait 1042e05d0
0 101 system netbsd 126 pgdaemon 1cba580
0 98 system netbsd 43 swwreboot 103f30c08
0 59 system netbsd 96 npfgcw 103f48688
0 58 system netbsd 222 rt_free 1033e1ac8
0 57 system netbsd 96 unpgc 1ccafc0
0 56 system netbsd 222 icmp6_wqinput 1033e1988
0 55 system netbsd 222 nd6_timer 1033e1848
0 54 system netbsd 222 carp6_wqinput 1033e1708
0 53 system netbsd 222 carp_wqinput 1033e15c8
0 31 system netbsd 222 icmp_wqinput 1033e1488
0 30 system netbsd 222 rt_timer 1033e1348
0 29 system netbsd 125 vmem_rehash 1033e1208
0 20 system netbsd 43 entropy 1c71f68
0 19 system netbsd 96 atath 103efc9c0
0 18 system netbsd 96 atath 103efc3d0
0 17 system netbsd 96 pmsreset 103ee3b94
0 16 system netbsd 96 smtaskq 1c9ce48
0 15 system netbsd 43 pmfsuspend 1033e0808
0 14 system netbsd 43 pmfevent 1033e06c8
0 13 system netbsd 96 sopendfr 1ccaf48
0 12 system netbsd 222 ifwdog 1033e0588
0 11 system netbsd 222 iflnkst 1033e0448
0 10 system netbsd 43 nfssilly 1033e0308
0 9 system netbsd 125 vdrain 1ccc0e0
0 8 system netbsd 125 mod_unld 1cbf158
0 7 system netbsd 127 xcall 2014010
0 6 system netbsd 223 0
0 5 system netbsd 220 0
0 4 system netbsd 221 0
0 3 system netbsd 222 0
0 2 system netbsd 0 0
0 0 system netbsd 125 uvm 1c5ff40
db{0}> x/Lx 104141678
104141678: 103f41ac5
db{0}> bt/a 103f41ac0
trace: pid 0 lid 102 at 0x22451ee51
sleepq_block(0, 0, 1c5ef88, 0, 1cbec00, 103f41ac0) at netbsd:sleepq_block+0xa4
cv_wait(1042e05d0, 1033ca640, 1b13c00, 1b10400, 0, 103f41ac0) at netbsd:cv_wait+0x3c
biowait(1042e04e0, 1042e0600, 1b13c00, 1b13c00, fffffffffffffffe, 1042e05d0) at netbsd:biowait+0x34
wapbl_buffered_flush(104141600, 1, 0, 1042e05a8, 0, 1042e04e0) at netbsd:wapbl_buffered_flush+0x168
wapbl_write_commit(104141600, 37ea00, 1cc000, 1c61c00, 0, 1042ea8c0) at netbsd:wapbl_write_commit+0xc0
wapbl_flush(104141600, 0, 104141678, 104141670, 1, 37ea00) at netbsd:wapbl_flush+0x710
ffs_sync(103f2d000, 2, 1033e8040, 0, 0, 0) at netbsd:ffs_sync+0x174
VFS_SYNC(103f2d000, 3, 1033e8040, 1c7c000, 1, 103f2d000) at netbsd:VFS_SYNC+0x24
sched_sync(1ccc018, e0048000, 1c61eec, 1033e41b0, 1ccc020, 0) at netbsd:sched_sync+0x120
lwp_trampoline(ffd0d92c, 1140d0, 1173b8, 1, ffe91c98, 0) at netbsd:lwp_trampoline+0x8
db{0}>
Here is the dmesg of the affected machine:
db{0}> dmesg
[ 1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
[ 1.0000000] 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
[ 1.0000000] 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
[ 1.0000000] The NetBSD Foundation, Inc. All rights reserved.
[ 1.0000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
[ 1.0000000] The Regents of the University of California. All rights reserved.
[ 1.0000000] NetBSD 10.99.2 (GENERIC) #0: Mon Jan 23 13:40:04 UTC 2023
[ 1.0000000] mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/sparc64/compile/GENERIC
[ 1.0000000] total memory = 3840 MB
[ 1.0000000] avail memory = 3750 MB
[ 1.0000000] timecounter: Timecounters tick every 10.000 msec
[ 1.0000000] Kernelized RAIDframe activated
[ 1.0000000] mainbus0 (root): OpenBiosTeam,OpenBIOS: hostid 80800400
[ 1.0000000] cpu0 at mainbus0: SUNW,UltraSPARC-IIi @ 100 MHz, CPU id 0
[ 1.0000000] cpu0: manuf 17, impl 12, mask 91
[ 1.0000000] cpu0: 256K instruction (64 b/l), 16K data (32 b/l), 256K external (64 b/l)
[ 1.0000000] psycho0 at mainbus0
[ 1.0000000] psycho0: SUNW,sabre: impl 0, version 0: ign 7c0 bus range 0 to 2; PCI bus 0
[ 1.0000000] pci0 at psycho0
[ 1.0000000] pci0: i/o space, memory space enabled
[ 1.0000000] ppb0 at pci0 dev 1 function 1: Sun Microsystems Simba PCI Bridge (rev. 0x11)
[ 1.0000000] pci1 at ppb0 bus 1
[ 1.0000000] pci1: i/o space, memory space enabled
[ 1.0000000] ebus0 at pci1 dev 1 function 0: Sun Microsystems PCIO Ebus2, revision 0x01
[ 1.0000000] clock0 at ebus0 addr 2000-3fff: mk48t59
[ 1.0000000] power at ebus0 addr 7240-7243 ipl 1 not configured
[ 1.0000000] fdthree at ebus0 addr 0-ffffffff not configured
[ 1.0000000] com0 at ebus0 addr 3f8-3ff ipl 2b: ns16550a, 16-byte FIFO
[ 1.0000000] com0: console
[ 1.0000000] pckbc0 at ebus0 addr 60-67 ipl 29autoconfiguration error: : no intr 1: selftest ok
[ 1.0000000] pckbd0 at pckbc0 (kbd slot)
[ 1.0000000] wskbd1 at pckbd0 mux 1
[ 1.0000000] pms0 at pckbc0 (aux slot)
[ 1.0000000] wsmouse0 at pms0 mux 0
[ 1.0000000] hme0 at pci1 dev 1 function 1: Sun Happy Meal Ethernet, rev. 0
[ 1.0000000] hme0: interrupting at ivec 3021
[ 1.0000000] hme0: Ethernet address 52:54:00:80:04:00
[ 1.0000000] nsphy0 at hme0 phy 1: DP83840 10/100 media interface, rev. 1
[ 1.0000000] nsphy0: 100baseTX-FDX, auto
[ 1.0000000] genfb0 at pci1 dev 2 function 0: vendor 1234 product 1111 (rev. 0x02)
[ 1.0000000] cmdide0 at pci1 dev 3 function 0: CMD Technology PCI0646 (rev. 0x07)
[ 1.0000000] cmdide0: bus-master DMA support present
[ 1.0000000] cmdide0: primary channel configured to native-PCI mode, channel non-independent
[ 1.0000000] cmdide0: using ivec 1820 for native-PCI interrupt
[ 1.0000000] atabus0 at cmdide0 channel 0
[ 1.0000000] cmdide0: secondary channel configured to native-PCI mode, channel non-independent
[ 1.0000000] atabus1 at cmdide0 channel 1
[ 1.0000000] ppb1 at pci0 dev 1 function 0: Sun Microsystems Simba PCI Bridge (rev. 0x11)
[ 1.0000000] pci2 at ppb1 bus 2
[ 1.0000000] pci2: i/o space, memory space enabled
[ 1.0000000] pcons at mainbus0 not configured
[ 1.0000000] WARNING: system needs entropy for security; see entropy(7)
[ 1.0000000] timecounter: Timecounter "tick-counter" frequency 100000000 Hz quality 100
[ 1.0000030] No counter-timer -- using %tick at 100MHz as system clock.
[ 1.0000030] timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
[ 4.0733830] wd0 at atabus0 drive 0
[ 4.0835200] wd0: <QEMU HARDDISK>
[ 4.0835200] wd0: drive supports 16-sector PIO transfers, LBA48 addressing
[ 4.0835200] wd0: 4096 MB, 8322 cyl, 16 head, 63 sec, 512 bytes/sect x 8388608 sectors
[ 4.1029250] wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
[ 4.1143940] wd0(cmdide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA)
[ 4.1143940] wd1 at atabus1 drive 0
[ 4.1143940] wd1: <QEMU HARDDISK>
[ 4.1143940] wd1: drive supports 16-sector PIO transfers, LBA48 addressing
[ 4.1235730] wd1: 8192 MB, 16644 cyl, 16 head, 63 sec, 512 bytes/sect x 16777216 sectors
[ 4.1235730] wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
[ 4.1235730] wd1(cmdide0:1:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA)
[ 4.1325440] swwdog0: software watchdog initialized
[ 4.1414680] WARNING: 1 error while detecting hardware; check system log.
[ 4.1414680] root on wd0a dumps on wd0b
[ 4.1518890] /: replaying log to memory
[ 4.1518890] root file system type: ffs
[ 4.1518890] kern.module.path=/stand/sparc64/10.99.2/modules
[ 5.2935370] /: replaying log to disk
[ 16.2327440] nfs server 192.168.93.81:/a/pkgsrc: not responding
[ 22.1324590] nfs server 192.168.93.81:/a/pkgsrc: is alive again
db{0}>
This is the command I use to launch the guest:
qemu-system-sparc64 -m 3840 \
-drive file=aldea.img,format=raw \
-drive file=aldea2.qcow2,bus=1 \
-net nic,netdev=n0,macaddr=52:54:00:80:04:00 -netdev vde,id=n0,sock=vde0 \
-nographic
>How-To-Repeat:
- Install NetBSD/sparc64 in QEMU.
- Generate lots of disk I/O.
- Notice that the system becomes unresponsive to new I/O requests, and does not recover from this condition.
>Fix:
Home |
Main Index |
Thread Index |
Old Index