Port-macppc archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: TSTILE lockups on disk dump/backups
Another instance of the tstile hang with dump/gzip/nfs
ps -axl and vmstat-s below: (machine name is mercy)
> +mercy$ ps -axl
> UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
> 0 0 0 0 125 0 0 36844 uvm DKl ? 8:23.50 [system]
> 0 1 0 0 85 0 6644 1688 wait Is ? 0:00.32 init
> 0 38 1 0 85 0 6532 1808 nanoslp I ? 0:00.12 /usr/libexec/getty std.9600 ttyE2
> 0 107 367 1317 85 0 19328 7380 select Is ? 0:00.37 sshd: donlee [priv]
> 0 172 1 0 85 0 12712 2428 kqueue Ss ? 0:14.56 /usr/sbin/syslogd
> 0 202 1 0 85 0 30816 13452 kqueue Is ? 0:12.95 /usr/pkg/sbin/named
> 0 255 1 0 85 0 12648 11688 pause Ss ? 0:08.38 /usr/sbin/ntpd -p /var/run/ntpd.pid
> 0 343 1 0 85 0 6624 1568 select Is ? 0:00.00 /usr/sbin/lpd -s
> 0 367 1 0 85 0 18528 3048 select Is ? 0:09.63 sshd: /usr/sbin/sshd [listener] 0 of 10-100 sta
> 0 553 1 270 85 0 6092 1684 ttyraw I ? 0:00.00 /usr/libexec/getty std.38400 ttyZ0
> 0 565 1 0 85 0 6532 1808 nanoslp I ? 0:00.05 /usr/libexec/getty std.9600 ttyE1
> 0 566 1 0 85 0 6528 1656 kqueue Is ? 0:00.04 /usr/sbin/inetd -l
> 1043 597 659 0 85 0 19812 5232 select I ? 0:00.30 sshd: screview
> 1001 598 107 0 85 0 19872 5268 select I ? 0:00.70 sshd: donlee@pts/0 (sshd)
> 0 604 1 0 85 0 7276 1752 nanoslp Ss ? 0:00.32 /usr/sbin/cron
> 0 613 1 0 85 0 6532 1808 nanoslp I ? 0:00.05 /usr/libexec/getty std.9600 ttyE3
> 0 635 1 0 85 0 17136 2764 kqueue Ss ? 0:01.65 /usr/libexec/postfix/master -w
> 12 654 635 0 85 0 17240 4980 kqueue I ? 0:00.25 qmgr -l -t unix -u
> 0 656 1 0 85 0 28428 10672 kqueue Is ? 0:00.21 /usr/pkg/sbin/lwresd
> 0 659 367 457 85 0 19212 7252 select Is ? 0:00.25 sshd: screview [priv]
> 999 3231 3916 0 85 0 36192 4228 socket I ? 0:00.07 /usr/pkg/sbin/httpd -k start
> 12 3877 635 0 85 0 16984 4848 kqueue I ? 0:00.03 pickup -l -t unix -u
> 0 3916 1 0 85 0 97904 20532 select Ss ? 0:03.20 /usr/pkg/sbin/httpd -k start
> 0 4184 24698 1966 75 20 40640 33848 pause IN ? 0:14.70 dump -t -0 -u -f - -L usr0-241230 -a /usr
> 0 5695 367 0 85 0 19264 7440 select Is ? 0:00.26 sshd: donlee [priv]
> 999 6831 3916 0 85 0 98944 10548 semwait I ? 0:00.09 /usr/pkg/sbin/httpd -k start
> 0 7068 13979 0 85 1 5768 1280 pipe_rd IN ? 0:00.01 tee -a /var/log/dump.log
> 999 7083 3916 0 85 0 98384 9340 kqueue S ? 0:00.05 /usr/pkg/sbin/httpd -k start
> 0 7120 13979 30971 77 1 6780 1396 pause IN ? 0:00.01 /bin/ksh /root/tapedump/dodumps
> 1001 7261 5695 0 85 0 19776 5276 select I ? 0:00.46 sshd: donlee@pts/2 (sshd)
> 0 7427 323 8 85 0 6128 1548 nanoslp S ? 0:00.00 sleep 150
> 0 7719 367 0 85 0 19264 7440 select Is ? 0:00.25 sshd: donlee [priv]
> 999 8401 3916 0 85 0 98416 10272 semwait I ? 0:00.09 /usr/pkg/sbin/httpd -k start
> 0 9902 13979 0 85 1 8252 2596 pipe_rd IN ? 0:00.01 mail -s mercy weekly remote backup for 241230 r
> 0 13979 29553 81 85 1 6772 1812 pause IN ? 0:00.01 /bin/ksh /root/tapedump/dodumps
> 0 17715 7120 26444 69 20 40640 35328 wait IN ? 0:01.52 dump -t -0 -u -f - -L usr0-241230 -a /usr
> 999 20660 3916 0 85 0 98416 10280 semwait I ? 0:00.09 /usr/pkg/sbin/httpd -k start
> 1001 21854 7719 0 85 0 19776 5276 select S ? 0:00.57 sshd: donlee@pts/1 (sshd)
> 0 22564 24698 2757 75 20 40640 33848 pipe_wr IN ? 0:14.75 dump -t -0 -u -f - -L usr0-241230 -a /usr
> 0 24698 17715 1430 75 20 40764 31228 netio IN ? 0:05.81 dump -t -0 -u -f - -L usr0-241230 -a /usr
> 999 25784 3916 0 85 0 99184 11072 semwait I ? 0:00.26 /usr/pkg/sbin/httpd -k start
> 0 25989 7120 31728 109 1 7144 2108 tstile DN ? 6:36.25 gzip -c
> 0 28732 24698 2313 75 20 40640 33848 pause IN ? 0:14.90 dump -t -0 -u -f - -L usr0-241230 -a /usr
> 999 29067 3916 0 85 0 98384 9328 semwait I ? 0:00.05 /usr/pkg/sbin/httpd -k start
> 0 29553 19597 0 85 1 6680 1964 wait INs ? 0:00.01 sh
> 999 29617 3916 0 85 0 98784 10136 semwait I ? 0:00.02 /usr/pkg/sbin/httpd -k start
> 999 29830 3916 0 85 0 99104 11688 semwait I ? 0:00.41 /usr/pkg/sbin/httpd -k start
> 999 29967 3916 0 85 0 99104 11696 semwait I ? 0:00.33 /usr/pkg/sbin/httpd -k start
> 1001 439 598 855 85 0 7092 1924 pause Is pts/0 0:00.02 -ksh
> 0 786 439 0 85 0 7048 1924 ttyraw I+ pts/0 0:00.22 ksh
> 0 6952 26465 0 85 0 7064 1928 pause S pts/1 0:00.21 ksh
> 0 25197 6952 0 43 0 6896 1680 - O+ pts/1 0:00.01 ps -axl
> 1001 26465 21854 0 85 0 7108 1928 pause Is pts/1 0:00.02 -ksh
> 0 2841 6887 0 85 0 8760 2988 ttyraw I+ pts/2 0:00.20 vi /var/log/dump.log
> 0 6887 21829 0 85 0 7048 1924 pause I pts/2 0:00.20 ksh
> 32767 19597 1 0 85 0 6256 1324 wait I pts/2 0:00.00 /usr/libexec/atrun
> 1001 21829 7261 0 85 0 7092 1924 pause Is pts/2 0:00.02 -ksh
> 0 323 1 8 85 0 6788 1776 pause S ttyE0- 0:00.98 /bin/ksh /usr/local/apache/admin/watch_ap
> 0 541 1 270 85 0 6284 1800 ttyraw Is+ ttyE0 0:00.01 /usr/libexec/getty std.9600 ttyE0
> +mercy$ ps -axl | grep tsti
> 0 25989 7120 31728 109 1 7144 2108 tstile DN ? 6:36.25 gzip -c
> +mercy$ vmstat -s
> 4096 bytes per page
> 1 page color
> 250844 pages managed
> 885 pages free
> 153061 pages active
> 67759 pages inactive
> 0 pages paging
> 2936 pages wired
> 0 zero pages
> 1 reserve pagedaemon pages
> 5 reserve kernel pages
> 4915 boot kernel pages
> 25412 kernel pool pages
> 23900 anonymous pages
> 175666 cached file pages
> 24190 cached executable pages
> 256 minimum free pages
> 341 target free pages
> 83614 maximum wired pages
> 1 swap devices
> 1048697 swap pages
> 0 swap pages in use
> 0 swap allocations
> 20473827 total faults taken
> 20880765 traps
> 13907464 device interrupts
> 16647155 CPU context switches
> 6573417 software interrupts
> 0 system calls
> 0 pagein requests
> 0 pageout requests
> 0 pages swapped in
> 0 pages swapped out
> 26581 forks total
> 779 forks blocked parent
> 779 forks shared address space with parent
> 0 pagealloc zero wanted and avail
> 2320493 pagealloc zero wanted and not avail
> 0 aborts of idle page zeroing
> 6961321 pagealloc desired color avail
> 0 pagealloc desired color not avail
> 6961321 pagealloc local cpu avail
> 0 pagealloc local cpu not avail
> 0 faults with no memory
> 0 faults with no anons
> 0 faults had to wait on pages
> 0 faults found released page
> 5108 faults relock (5108 ok)
> 9027836 anon page faults
> 0 anon retry faults
> 1934161 amap copy faults
> 2080679 neighbour anon page faults
> 27697269 neighbour object page faults
> 8951102 locked pager get faults
> 5105 unlocked pager get faults
> 7566670 anon faults
> 1412090 anon copy on write faults
> 7688062 object faults
> 1263040 promote copy faults
> 2320090 promote zero fill faults
> 1056 times daemon wokeup
> 1056 revolutions of the clock hand
> 1173219 pages freed by daemon
> 1432806 pages scanned by daemon
> 0 anonymous pages scanned by daemon
> 1173219 object pages scanned by daemon
> 7994 pages reactivated
> 18 pages found busy by daemon
> 0 total pending pageouts
> 1734493 pages deactivated
> 20226193 total name lookups
> 18536033 good hits
> 1298945 negative hits
> 522 bad hits
> 0 false hits
> 348649 miss
> 42044 too long
> 251453 pass2 hits
> 256376 2passes
> cache hits (91% pos + 6% neg) system 1% per-process
> deletions 0%, falsehits 0%, toolong 0%
>
> +mercy$
>
-dgl-
> On Dec 21, 2024, at 9:21 AM, Martin Husemann <martin%duskware.de@localhost> wrote:
>
> On Sat, Dec 21, 2024 at 02:20:23AM -0600, Don Lee wrote:
>> Any thought about how I should deal with this? My kernel debugging
>> skills are weak. Is there anything I can do to produce useful debug
>> information next time this happens?
>
> A tstile just means the process/kernel thread is blocked for a longer
> time waiting for "something". In this context that would typically be
> either a disk buffer ("buf") read from disk (the source you are backing
> up) or a network buffer ("mbuf") handed to the network device output
> queue.
>
> This would point to a bug in the disk driver or the network driver.
>
> The other option would be your system runing out of kernel address space,
> so it can not allocate either type of buffer. This would show up in
> the output of vmstat -s (or you are running against limits for either
> type of buffer, but I would expect that to fail differently).
>
> Can you start new processes in the state when it hangs? Like the vmstat -s
> mentioned above, or "crash" to look at kernel state? I am not sure "crash"
> would work on macppc and 9.x good enough.
>
> Martin
+mercy$ ps -axl
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
0 0 0 0 125 0 0 67804 uvm DKl ? 96:15.96 [system]
0 1 0 0 85 0 6644 284 wait Is ? 0:04.78 init
0 38 1 0 85 0 6532 1808 nanoslp I ? 0:00.97 /usr/libexec/getty std.9600 ttyE1
0 172 1 0 85 0 11928 2424 kqueue Ss ? 3:46.75 /usr/sbin/syslogd
0 202 1 0 85 0 31920 15264 kqueue Is ? 3:10.08 /usr/pkg/sbin/named
0 237 1 142 85 0 6624 1568 select Is ? 0:00.00 /usr/sbin/lpd -s
0 373 1 0 85 0 11832 10872 pause Ss ? 1:53.24 /usr/sbin/ntpd -p /var/run/ntpd.pid
0 378 1 1826 85 0 18528 3048 select Is ? 2:31.59 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
0 462 1 0 85 0 6532 732 nanoslp I ? 0:01.18 /usr/libexec/getty std.9600 ttyE2
0 560 1 0 85 0 28428 10672 kqueue Is ? 0:00.95 /usr/pkg/sbin/lwresd
0 575 1 0 85 0 6528 1656 kqueue Is ? 0:00.19 /usr/sbin/inetd -l
0 606 1 0 85 0 17136 2740 kqueue Ss ? 0:22.73 /usr/libexec/postfix/master -w
12 635 606 0 85 0 16744 4912 kqueue I ? 0:02.88 qmgr -l -t unix -u
0 648 1 0 85 0 7276 1752 nanoslp Ss ? 0:04.53 /usr/sbin/cron
0 786 1 228 85 0 6092 40 ttyraw I ? 0:00.00 /usr/libexec/getty std.38400 ttyZ0
0 805 1 0 85 0 6532 1808 nanoslp I ? 0:00.67 /usr/libexec/getty std.9600 ttyE3
999 981 24032 0 85 0 98912 11400 semwait S ? 0:00.60 /usr/pkg/sbin/httpd -k start
1043 1754 15863 0 85 0 19780 5272 select S ? 0:03.16 sshd: screview
0 4206 27475 679 85 0 6128 1548 nanoslp I ? 0:00.00 sleep 150
12 4681 606 0 85 0 17008 4916 kqueue I ? 0:00.03 proxymap -t unix -u
999 5475 24032 0 85 0 99136 11808 semwait S ? 0:00.57 /usr/pkg/sbin/httpd -k start
1001 6977 21060 0 85 0 19872 5284 select S ? 0:00.48 sshd: donlee@pts/2 (sshd)
999 7535 24032 0 85 0 98752 10756 kqueue S ? 0:00.08 /usr/pkg/sbin/httpd -k start
999 8921 24032 0 85 0 99104 11420 semwait S ? 0:00.54 /usr/pkg/sbin/httpd -k start
1001 8972 27821 0 85 0 19776 5284 select I ? 0:00.50 sshd: donlee@pts/0 (sshd)
12 11260 606 0 85 0 16984 4848 kqueue S ? 0:00.03 pickup -l -t unix -u
999 14429 24032 967 85 0 36192 4280 socket I ? 0:00.07 /usr/pkg/sbin/httpd -k start
0 15863 378 22 85 0 19212 7340 select Is ? 0:00.25 sshd: screview [priv]
0 15888 378 960 85 0 19232 7424 select Is ? 0:00.36 sshd: donlee [priv]
12 16463 606 0 85 0 17816 5412 kqueue I ? 0:00.03 smtpd -n smtp -t inet -u -o stress= -s 2
1001 16576 18110 0 85 0 19872 5280 select I ? 0:00.33 sshd: donlee@pts/3 (sshd)
0 18110 378 4580 84 0 19328 7440 select Is ? 0:00.37 sshd: donlee [priv]
12 18451 606 0 85 0 16976 4832 kqueue S ? 0:00.03 anvil -l -t unix -u
999 18588 24032 0 85 0 98608 10588 semwait S ? 0:00.08 /usr/pkg/sbin/httpd -k start
0 21060 378 4284 84 0 19328 7440 select Is ? 0:00.38 sshd: donlee [priv]
1001 22743 15888 0 85 0 19776 5280 select I ? 0:00.79 sshd: donlee@pts/1 (sshd)
999 23402 24032 0 85 0 98752 10820 semwait S ? 0:00.27 /usr/pkg/sbin/httpd -k start
0 24032 1 0 85 0 97904 22004 select Ss ? 0:03.20 /usr/pkg/sbin/httpd -k start
0 27475 1 679 85 0 6788 1840 pause I ? 0:01.78 /bin/ksh /usr/local/apache/admin/watch_ap
0 27821 378 263 85 0 19232 7428 select Is ? 0:00.37 sshd: donlee [priv]
999 29660 24032 0 85 0 98752 10716 semwait S ? 0:00.31 /usr/pkg/sbin/httpd -k start
1001 5918 8972 0 85 0 7092 1924 ttyraw Is+ pts/0 0:00.03 -ksh
0 80 1 0 76 19 40640 30044 pause IN+ pts/1 0:00.44 dump -t -0 -u -f - -L usr0-241229 -a /usr
0 3132 10445 0 85 0 5768 1280 pipe_rd I+ pts/1 0:00.01 tee -a /var/log/dump.log
1001 5899 22743 0 85 0 7128 1936 pause Is pts/1 0:00.02 -ksh
0 7387 10445 0 85 0 6780 1376 pause I+ pts/1 0:00.01 /bin/ksh ./dodumps
0 10445 16554 0 85 0 6772 1836 pause I+ pts/1 0:00.01 /bin/ksh ./dodumps
0 16554 5899 0 85 0 7048 1928 pause I pts/1 0:00.33 ksh
0 21288 7387 0 117 0 7144 2108 tstile D+ pts/1 0:10.81 gzip -c
0 29716 10445 0 85 0 8508 2788 pipe_rd I+ pts/1 0:00.01 mail -s mercy weekly remote backup for 241229 root
1001 21766 6977 0 85 0 7092 1924 pause Is pts/2 0:00.03 -ksh
0 26577 29993 0 43 0 6896 1680 - O+ pts/2 0:00.01 ps -axl
0 29993 21766 0 85 0 7048 1924 pause S pts/2 0:00.20 ksh
0 8984 12914 0 85 0 7048 1924 ttyraw I+ pts/3 0:00.19 ksh
1001 12914 16576 113 85 0 7092 1924 pause Is pts/3 0:00.02 -ksh
0 694 1 0 85 0 6284 1800 ttyraw Is+ ttyE0 0:00.01 /usr/libexec/getty std.9600 ttyE0
+mercy$ ^C -axl
+mercy$ vmstat -s
4096 bytes per page
1 page color
250844 pages managed
2222 pages free
140041 pages active
69363 pages inactive
0 pages paging
2730 pages wired
0 zero pages
1 reserve pagedaemon pages
5 reserve kernel pages
4915 boot kernel pages
35578 kernel pool pages
23598 anonymous pages
164759 cached file pages
23777 cached executable pages
256 minimum free pages
341 target free pages
83614 maximum wired pages
1 swap devices
1048697 swap pages
0 swap pages in use
0 swap allocations
225500305 total faults taken
228149211 traps
156287160 device interrupts
50546946 CPU context switches
63050311 software interrupts
0 system calls
0 pagein requests
0 pageout requests
0 pages swapped in
0 pages swapped out
392258 forks total
9727 forks blocked parent
9727 forks shared address space with parent
0 pagealloc zero wanted and avail
32805864 pagealloc zero wanted and not avail
0 aborts of idle page zeroing
80964508 pagealloc desired color avail
0 pagealloc desired color not avail
80964508 pagealloc local cpu avail
0 pagealloc local cpu not avail
0 faults with no memory
0 faults with no anons
0 faults had to wait on pages
0 faults found released page
15090 faults relock (15090 ok)
52280209 anon page faults
0 anon retry faults
29181270 amap copy faults
29921894 neighbour anon page faults
421666821 neighbour object page faults
137180816 locked pager get faults
14978 unlocked pager get faults
30782693 anon faults
21465518 anon copy on write faults
118058653 object faults
19122163 promote copy faults
32802454 promote zero fill faults
5143 times daemon wokeup
5143 revolutions of the clock hand
5724141 pages freed by daemon
6745369 pages scanned by daemon
0 anonymous pages scanned by daemon
5724141 object pages scanned by daemon
33115 pages reactivated
233 pages found busy by daemon
0 total pending pageouts
6899800 pages deactivated
314091251 total name lookups
287646312 good hits
20255856 negative hits
4097 bad hits
0 false hits
5508051 miss
676935 too long
3818493 pass2 hits
3889436 2passes
cache hits (91% pos + 6% neg) system 1% per-process
deletions 0%, falsehits 0%, toolong 0%
+mercy$
Home |
Main Index |
Thread Index |
Old Index