NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/57621: Updated 10.0_BETA macppc MP kernel prone to hangs



>Number:         57621
>Category:       kern
>Synopsis:       Updated 10.0_BETA macppc MP kernel prone to hangs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Sep 19 21:40:01 +0000 2023
>Originator:     he%NetBSD.org@localhost
>Release:        NetBSD 10.0_BETA
>Organization:
 I Try...
>Environment:
	
	
System: (it's still wedged, so can't do "uname -a", but it's a
	relatively recent macppc 10.0_BETA kernel and user-land)
Architecture: powerpc
Machine: macppc
>Description:
	It appears that a reasonably up-to-date 10.0_BETA kernel is
	prone to hangs on this particular system.  A -current kernel
	built from sources dating June 8 2022 appears to be more
	stable than a 10.0_BETA kernel from e.g. April 10 2023 or a
	more recently updated 10.0_BETA kernel.

	The annoying thing is that I can't get DDB to work on this
	system.

	The symptom is that simply doing a "build.sh -j 2 ... release"
	job will with nearly probability 1 trigger a wedge where a few
	processes get stuck in "tstile" state, and this disease then
	spreads to other activity on the machine which grinds to a
	wedge.

	However, not *everything* comes to a halt, e.g. a remote xterm
	with 'top' running may continue to run.  Same for 'systat vm',
	and also same with 'crash'.

	A fairly typical 'top' output (when it's wedged, cut+pasted
	just now, while the host wedged a couple of days ago):

load averages:  0.01,  0.02,  0.00;               up 2+11:22:17        23:27:20
83 processes: 81 sleeping, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Memory: 911M Act, 560M Inact, 14M Wired, 23M Exec, 1368M File, 249M Free
Swap: 2000M Total, 40K Used, 2000M Free / Pools: 235M Used

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
    0 root      43    0     0K   83M dbcool/1   11:17  1.90%  1.90% [system]
 3361 he        85    0  9996K 1652K ttyraw/0   11:49  0.00%  0.00% systat
 1242 he        85    0    17M 6308K select/1    5:14  0.00%  0.00% xterm
 1851 he        43    0    10M 2180K CPU/1       2:28  0.00%  0.00% top
 1255 he        85    0    24M 6696K poll/0      1:56  0.00%  0.00% sshd
 1200 he        85    0    17M 3828K select/1    1:36  0.00%  0.00% xterm
 1229 he        85    0    33M 2836K poll/0      1:26  0.00%  0.00% sshd
 5367 root      43    0  8024K 2508K CPU/0       0:55  0.00%  0.00% top
 1217 he        85    0    17M 7120K select/1    0:20  0.00%  0.00% xterm
 1253 he        85    0    23M 1632K poll/1      0:17  0.00%  0.00% sshd
  800 root      85    0    15M   14M pause/0     0:10  0.00%  0.00% ntpd
  207 root      83    4  9564K  648K poll/0      0:08  0.00%  0.00% nbmake
13699 he        85    0    22M 6132K poll/0      0:06  0.00%  0.00% sshd
24480 root      85    0  6284K  320K kqueue/0    0:06  0.00%  0.00% tail
18249 he       117    0    18M 4912K tstile/0    0:05  0.00%  0.00% xterm
 8682 he        85    0    23M 2252K poll/1      0:05  0.00%  0.00% sshd
20207 root      83    4  9404K 2724K poll/0      0:04  0.00%  0.00% nbmake
 2217 root      83    4  9404K 2656K poll/0      0:04  0.00%  0.00% nbmake
27442 he        85    0    17M 3260K select/1    0:03  0.00%  0.00% xterm
24118 he       117    0    23M 1748K tstile/1    0:02  0.00%  0.00% sshd
13350 root      85    0    25M 3196K ttyraw/1    0:01  0.00%  0.00% crash
 1085 root     117    0    19M 1192K tstile/0    0:00  0.00%  0.00% master
 1075 root     117    0  7824K 1068K tstile/0    0:00  0.00%  0.00% cron
 1254 postfix  117    0    20M  828K tstile/1    0:00  0.00%  0.00% qmgr
 2221 root     108    4  4112K  420K tstile/1    0:00  0.00%  0.00% nbsed
 1105 root     108    4  2016K  216K tstile/0    0:00  0.00%  0.00% sh
27957 root     108    4  1608K  176K tstile/0    0:00  0.00%  0.00% nbgrep
 9720 root      85    0    22M 9396K poll/1      0:00  0.00%  0.00% sshd
12768 root      85    0    22M 9296K poll/0      0:00  0.00%  0.00% sshd
 1218 root      85    0    22M 9188K poll/0      0:00  0.00%  0.00% sshd
 1236 root      85    0    22M 8916K poll/1      0:00  0.00%  0.00% sshd
20726 he        85    0    17M 6580K select/1    0:00  0.00%  0.00% xterm
12983 he        85    0    23M 6496K poll/0      0:00  0.00%  0.00% sshd
...

	A couple of things can be read out here:

         * Several processes are stuck in 'tstile' wait channel (as
           described above).
         * There does not *appear* to be a drastic physical memory
           shortfall, and swap is barely used.

	The last point made me think that "this can't be VM related",
	but it looks like I was wrong.

	I also managed to collect "ps" and, I think, "bt/a" traces of
	all the processes stuck in 'tstile' at the time of this run.
	The output from those can be seen here:

crash> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
1105  1105 3   0         0           3347a100                 sh tstile
27957 27957 3   0         0           231a5c40             nbgrep tstile
2221  2221 3   1         0           1d2b5380              nbsed tstile
15261 15261 3   0       180           534e3080                 sh wait
23256 23256 3   0       180           61dbc3c0             nbmake poll
10572 10572 3   0       180           57015d00                 sh wait
17709 17709 3   0       180           34e79a00             nbmake poll
22661 22661 3   0       180           107803c0             nbmake wait
10847 10847 3   1         0           695a10c0             pickup tstile
15598 15598 3   1       180           5c5666c0                 sh wait
28627 28627 3   0       180           23276d00             nbmake poll
25303 25303 3   1       180           5c5660c0                 sh wait
29847 29847 3   0       180           61dbc6c0             nbmake poll
5367  5367 3   0       180           34e79d00                top select
320    320 3   1       180           1d2b5c80               tcsh pause
2526  2526 3   0       180           15287d00               tcsh pause
13699 13699 3   1       180           15287700               sshd poll
9720  9720 3   1       180           1389ec80               sshd poll
21947 21947 3   1       180           1d7420c0              crash ttyraw
28553 28553 3   0       180           10778040               tcsh pause
12880 12880 3   1       180           21080c40               tcsh pause
1679  1679 3   0       180           5c5663c0               sshd poll
1191  1191 3   1       180           21125080               sshd poll
19615 19615 3   1       180           3343c040                 sh wait
2217  2217 3   0       180           3347a400             nbmake poll
1584  1584 3   1       180           3347aa00                 sh wait
20207 20207 3   0       180           10780cc0             nbmake poll
11398 11398 3   0       180           61dbccc0                 sh wait
207    207 3   0       180           534e3380             nbmake poll
13301 13301 3   0       180           3347ad00                 sh wait
24480 24480 3   0       180           205d9940               tail kqueue
23505 23505 3   0       180           22168cc0               tcsh pause
21730 21730 3   1       180           1d7426c0               tcsh ttyraw
20726 20726 3   1       180           5c5669c0              xterm select
7409  7409 3   0       180           34e79100               tcsh pause
12983 12983 3   0       180           23276100               sshd poll
24443 24443 3   1       180           534e3680               sshd poll
13350>13350 7   0       100           57015400              crash
17585 17585 3   0       180           3343c940               tcsh ttyraw
14989 14989 3   0       180           21080340               tcsh pause
27442 27442 3   0       1c0           61dbc0c0              xterm select
27802 27802 3   0       180           1d2b5080               tcsh pause
24118 24118 3   0       180           215b50c0               sshd poll
12768 12768 3   0       180           1d742cc0               sshd poll
8163  8163 3   0       180           231a5340               tcsh pause
10465 10465 3   0       180           21125c80               tcsh pause
18249 18249 3   1       1c0           21080040              xterm poll
10566 10566 3   0       180           21125380               tcsh pause
8682  8682 3   1       180           215b5cc0               sshd poll
10625 10625 3   1       180           34e79400               sshd poll
3361  3361 3   0       180           1d2b5980             systat ttyraw
1851  1851 3   0       180           3343cc40                top select
1103  1103 3   1       180           10778c40               tcsh pause
1378  1378 3   0       180           221689c0               tcsh pause
1200  1200 3   1       180           10338700              xterm select
1277  1277 3   0       180           10778940               tcsh pause
1319  1319 3   0       180           10338d00               tcsh pause
1229  1229 3   1       180           10778640               sshd poll
1241  1241 3   0       180           102ae9c0               tcsh pause
1243  1243 3   1       180           1389e980               sshd poll
1242  1242 3   0       180           107800c0              xterm select
1232  1232 3   0       180           10338a00               tcsh pause
1255  1255 3   0       180           21c79c80               sshd poll
1236  1236 3   1       180           221683c0               sshd poll
1217  1217 3   0       180           1389e380              xterm select
1223  1223 3   0       180           221686c0               tcsh pause
1253  1253 3   0       180           1389e080               sshd poll
1218  1218 3   0       180           221680c0               sshd poll
1084  1084 3   0       180           21c79980              getty ttyraw
1112  1112 3   0       180           21c79680              getty nanoslp
1066  1066 3   0       180           21c79380              getty nanoslp
1275  1275 3   0       180           21c79080              getty nanoslp
1086  1086 3   1       180           1022f100              getty ttyraw
1075  1075 3   0        40           205d9c40               cron tstile
1070  1070 3   1       180           205d9640              inetd kqueue
1254  1254 3   0       180           205d9340               qmgr kqueue
1085  1085 3   0   1000040           10778340             master tstile
825    825 3   0       180           15287100               sshd poll
708    708 3   0       180           107809c0             powerd kqueue
800    800 3   0       180           107806c0               ntpd pause
604    604 3   0       180           1389e680            syslogd kqueue
1        1 3   0       180           102ae0c0               init wait
0      198 3   1       200           102aecc0            physiod physiod
0      166 3   1       200           1022f700          pooldrain pooldrain
0      165 3   0       200           10338400            ioflush syncer
0      164 3   0       200           10338100           pgdaemon pgdaemon
0      126 3   1       200           7f87cc80          swwreboot swwreboot
0      124 3   1       200           102ae6c0          atapibus0 sccomp
0      122 3   0       200           102196c0               usb0 usbevt
0      121 3   0       200           102ae3c0               usb1 usbevt
0      120 3   0       200           1022f400             npfgc0 npfgcw
0      119 3   1       200           1028cc80            rt_free rt_free
0      118 3   1       200           1028c980              unpgc unpgc
0      117 3   0       200           1028c680    key_timehandler key_timehandler

0      116 3   1       200           1028c380    icmp6_wqinput/1 icmp6_wqinput
0      115 3   0       200           1028c080    icmp6_wqinput/0 icmp6_wqinput
0      114 3   0       200           1027ec40          nd6_timer nd6_timer
0      113 3   1       200           1027e940    carp6_wqinput/1 carp6_wqinput
0      112 3   0       200           1027e640    carp6_wqinput/0 carp6_wqinput
0      111 3   1       200           1027e340     carp_wqinput/1 carp_wqinput
0      110 3   0       200           1027e040     carp_wqinput/0 carp_wqinput
0      109 3   1       200           1022fd00     icmp_wqinput/1 icmp_wqinput
0      108 3   0       200           102199c0     icmp_wqinput/0 icmp_wqinput
0      107 3   0       200           1022fa00           rt_timer rt_timer
0      106 3   0       200           10219cc0        vmem_rehash vmem_rehash
0       97 3   1       200           102193c0            lmtemp0 lmtemp0
0       96 3   0       200           102190c0            dbcool0 dbcool0
0       30 3   0       200           7f87c980          entbutler entropy
0       29 3   1       380           7f87c680           fw0probe ieee1394
0       28 3   0       240           7f87c380            atabus2 atath
0       27 3   1       200           7f87c080         usbtask-dr usbtsk
0       26 3   0       200           7f901c40         usbtask-hc usbtsk
0       25 3   1       240           7f901940            atabus1 atath
0       24 3   1       200           7f901640            atabus0 atath
0       23 3   0       200           7f901340                pmu wait
0       22 3   1       200           7f901040            xcall/1 xcall
0       21 1   1       200           7f90ed00          softser/1
0       20 1   1     40200           7f90ea00          softclk/1
0       19 1   1       200           7f90e700          softbio/1
0       18 1   1       200           7f90e400          softnet/1
0       17 1   1       201           7f90e100             idle/1
0       16 3   0       200           7f91acc0             sysmon smtaskq
0       15 3   0       200           7f91a9c0         pmfsuspend pmfsuspend
0       14 3   0       200           7f91a6c0           pmfevent pmfevent
0       13 3   0       200           7f91a3c0         sopendfree sopendfr
0       12 3   0       200           7f91a0c0             ifwdog ifwdog
0       11 3   0       200           7fb26c80            iflnkst iflnkst
0       10 3   0       200           7fb26980           nfssilly nfssilly
0        9 3   1       240           7fb26680             vdrain vdrain
0        8 3   0       200           7fb26380          modunload mod_unld
0        7 3   0       200           7fb26080            xcall/0 xcall
0        6 1   0       200           7fb30c40          softser/0
0        5 1   0     40200           7fb30940          softclk/0
0        4 1   0       200           7fb30640          softbio/0
0        3 1   0     40200           7fb30340          softnet/0
0    >   2 1   0       201           7fb30040             idle/0
0        0 3   1       200             c32dc0            swapper uvm
crash> 
crash> bt/a 1027ec40
trace: pid 0 lid 114 at 0x10267de0
0x10267e40: at cpu_switchto+0x28
0x10267e50: at mi_switch+0x140
0x10267e90: at sleepq_block+0xe0
0x10267eb0: at cv_wait+0xc0
0x10267ee0: at workqueue_worker+0x12c
0x10267f20: at cpu_lwp_bootstrap+0xc
0x10267fe8: kernel trap 0 by __end+0xf5fabf8: srr1=0x111f8340
            r1=0x2 cr=0x5 xer=0x40000 ctr=0xfffffff0
crash> bt/a 1d2b5380
trace: pid 2221 lid 2221 at 0x4da13bc0
0x4da13c20: at cpu_switchto+0x28
0x4da13c30: at mi_switch+0x140
0x4da13c70: at sleepq_block+0xe0
0x4da13c90: at turnstile_block+0x284
0x4da13cf0: at rw_enter+0x124
0x4da13d30: at uvm_fault_internal+0xa3c
0x4da13e70: at trap+0x674
0x4da13f20: user 0xfded1660: srr1=0xd032
            r1=0xffffe1e0 cr=0x84002008 xer=0x20000000 ctr=0xfded1650
crash> bt/a 231a5c40
trace: pid 27957 lid 27957 at 0x21583bc0
0x21583c20: at cpu_switchto+0x28
0x21583c30: at mi_switch+0x140
0x21583c70: at sleepq_block+0xe0
0x21583c90: at turnstile_block+0x284
0x21583cf0: at rw_enter+0x124
0x21583d30: at uvm_fault_internal+0xa3c
0x21583e70: at trap+0x674
0x21583f20: user 0xfded3d80: srr1=0xd032
            r1=0xffffda10 cr=0x84008004 xer=0x20000000 ctr=0
crash> bt/a 695a10c0
trace: pid 10847 lid 10847 at 0x5523ec10
0x5523ec70: at cpu_switchto+0x28
0x5523ec80: at mi_switch+0x140
0x5523ecc0: at sleepq_block+0xe0
0x5523ece0: at turnstile_block+0x284
0x5523ed40: at rw_enter+0x124
0x5523ed80: at uvm_unmap_remove+0x158
0x5523edc0: at uvmspace_free+0xe0
0x5523ede0: at exit1+0x1a8
0x5523eea0: at sys_exit+0x44
0x5523eec0: at syscall+0x294
0x5523ef20: user SC trap #1 by 0xfd201548: srr1=0xd032
            r1=0xffffe590 cr=0x24442442 xer=0x20000000 ctr=0xfd201540
crash> 
crash> bt/a 10778340
trace: pid 1085 lid 1085 at 0x107abbc0
0x107abc20: at cpu_switchto+0x28
0x107abc30: at mi_switch+0x140
0x107abc70: at sleepq_block+0xe0
0x107abc90: at turnstile_block+0x284
0x107abcf0: at rw_enter+0x124
0x107abd30: at uvm_fault_internal+0xa3c
0x107abe70: at trap+0x674
0x107abf20: user 0x18153e0: srr1=0xd032
            r1=0xffffd9b0 cr=0x28022444 xer=0 ctr=0xfd2016b8
crash> 
crash> bt/a 205d9c40
trace: pid 1075 lid 1075 at 0x17513ca0
0x17513d00: at cpu_switchto+0x28
0x17513d10: at mi_switch+0x140
0x17513d50: at sleepq_block+0xe0
0x17513d70: at turnstile_block+0x284
0x17513dd0: at rw_enter+0x124
0x17513e10: at uvmspace_fork+0x35c
0x17513e40: at uvm_proc_fork+0x34
0x17513e50: at fork1+0x334
0x17513ec0: at syscall+0x294
0x17513f20: user SC trap #2 by 0xfdede44c: srr1=0xd032
            r1=0xffffe8c0 cr=0x22000422 xer=0x20000000 ctr=0xfded7ba4
crash> 
crash> bt/a 3347a100
trace: pid 1105 lid 1105 at 0x1eb23bc0
0x1eb23c20: at cpu_switchto+0x28
0x1eb23c30: at mi_switch+0x140
0x1eb23c70: at sleepq_block+0xe0
0x1eb23c90: at turnstile_block+0x284
0x1eb23cf0: at rw_enter+0x124
0x1eb23d30: at uvm_fault_internal+0xa3c
0x1eb23e70: at trap+0x674
0x1eb23f20: user 0xfded3d80: srr1=0xd032
            r1=0xffffda30 cr=0x84008004 xer=0x20000000 ctr=0
crash> 

	This, somewhat unexpectedly, points to "something" in the UVM
	subsystem which might be amiss(?), but figuring out what is
	above my pay grade.

	Guidance for next steps in narrowing down the actual root
	cause gratefully accepted.

>How-To-Repeat:
	Update a dual-CPU macppc G4 system to recent 10.0_BETA code,
	and do a "build.sh -j 2 ... release" run.  The probability of
	it getting stuck sometime along the build is nearly 1.

>Fix:
	Sorry, don't know.

>Unformatted:
 	
 	


Home | Main Index | Thread Index | Old Index