Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: TSTILE lockups on disk dump/backups



Another instance of the tstile hang with dump/gzip/nfs

ps -axl and vmstat-s below: (machine name is mercy)

> +mercy$ ps -axl
>   UID   PID  PPID   CPU PRI NI   VSZ   RSS WCHAN   STAT TTY       TIME COMMAND
>     0     0     0     0 125  0     0 36844 uvm     DKl  ?      8:23.50 [system]
>     0     1     0     0  85  0  6644  1688 wait    Is   ?      0:00.32 init 
>     0    38     1     0  85  0  6532  1808 nanoslp I    ?      0:00.12 /usr/libexec/getty std.9600 ttyE2 
>     0   107   367  1317  85  0 19328  7380 select  Is   ?      0:00.37 sshd: donlee [priv] 
>     0   172     1     0  85  0 12712  2428 kqueue  Ss   ?      0:14.56 /usr/sbin/syslogd 
>     0   202     1     0  85  0 30816 13452 kqueue  Is   ?      0:12.95 /usr/pkg/sbin/named 
>     0   255     1     0  85  0 12648 11688 pause   Ss   ?      0:08.38 /usr/sbin/ntpd -p /var/run/ntpd.pid 
>     0   343     1     0  85  0  6624  1568 select  Is   ?      0:00.00 /usr/sbin/lpd -s 
>     0   367     1     0  85  0 18528  3048 select  Is   ?      0:09.63 sshd: /usr/sbin/sshd [listener] 0 of 10-100 sta
>     0   553     1   270  85  0  6092  1684 ttyraw  I    ?      0:00.00 /usr/libexec/getty std.38400 ttyZ0 
>     0   565     1     0  85  0  6532  1808 nanoslp I    ?      0:00.05 /usr/libexec/getty std.9600 ttyE1 
>     0   566     1     0  85  0  6528  1656 kqueue  Is   ?      0:00.04 /usr/sbin/inetd -l 
>  1043   597   659     0  85  0 19812  5232 select  I    ?      0:00.30 sshd: screview 
>  1001   598   107     0  85  0 19872  5268 select  I    ?      0:00.70 sshd: donlee@pts/0 (sshd)
>     0   604     1     0  85  0  7276  1752 nanoslp Ss   ?      0:00.32 /usr/sbin/cron 
>     0   613     1     0  85  0  6532  1808 nanoslp I    ?      0:00.05 /usr/libexec/getty std.9600 ttyE3 
>     0   635     1     0  85  0 17136  2764 kqueue  Ss   ?      0:01.65 /usr/libexec/postfix/master -w 
>    12   654   635     0  85  0 17240  4980 kqueue  I    ?      0:00.25 qmgr -l -t unix -u 
>     0   656     1     0  85  0 28428 10672 kqueue  Is   ?      0:00.21 /usr/pkg/sbin/lwresd 
>     0   659   367   457  85  0 19212  7252 select  Is   ?      0:00.25 sshd: screview [priv] 
>   999  3231  3916     0  85  0 36192  4228 socket  I    ?      0:00.07 /usr/pkg/sbin/httpd -k start 
>    12  3877   635     0  85  0 16984  4848 kqueue  I    ?      0:00.03 pickup -l -t unix -u 
>     0  3916     1     0  85  0 97904 20532 select  Ss   ?      0:03.20 /usr/pkg/sbin/httpd -k start 
>     0  4184 24698  1966  75 20 40640 33848 pause   IN   ?      0:14.70 dump -t -0 -u -f - -L usr0-241230 -a /usr 
>     0  5695   367     0  85  0 19264  7440 select  Is   ?      0:00.26 sshd: donlee [priv] 
>   999  6831  3916     0  85  0 98944 10548 semwait I    ?      0:00.09 /usr/pkg/sbin/httpd -k start 
>     0  7068 13979     0  85  1  5768  1280 pipe_rd IN   ?      0:00.01 tee -a /var/log/dump.log 
>   999  7083  3916     0  85  0 98384  9340 kqueue  S    ?      0:00.05 /usr/pkg/sbin/httpd -k start 
>     0  7120 13979 30971  77  1  6780  1396 pause   IN   ?      0:00.01 /bin/ksh /root/tapedump/dodumps 
>  1001  7261  5695     0  85  0 19776  5276 select  I    ?      0:00.46 sshd: donlee@pts/2 (sshd)
>     0  7427   323     8  85  0  6128  1548 nanoslp S    ?      0:00.00 sleep 150 
>     0  7719   367     0  85  0 19264  7440 select  Is   ?      0:00.25 sshd: donlee [priv] 
>   999  8401  3916     0  85  0 98416 10272 semwait I    ?      0:00.09 /usr/pkg/sbin/httpd -k start 
>     0  9902 13979     0  85  1  8252  2596 pipe_rd IN   ?      0:00.01 mail -s mercy weekly remote backup for 241230 r
>     0 13979 29553    81  85  1  6772  1812 pause   IN   ?      0:00.01 /bin/ksh /root/tapedump/dodumps 
>     0 17715  7120 26444  69 20 40640 35328 wait    IN   ?      0:01.52 dump -t -0 -u -f - -L usr0-241230 -a /usr 
>   999 20660  3916     0  85  0 98416 10280 semwait I    ?      0:00.09 /usr/pkg/sbin/httpd -k start 
>  1001 21854  7719     0  85  0 19776  5276 select  S    ?      0:00.57 sshd: donlee@pts/1 (sshd)
>     0 22564 24698  2757  75 20 40640 33848 pipe_wr IN   ?      0:14.75 dump -t -0 -u -f - -L usr0-241230 -a /usr 
>     0 24698 17715  1430  75 20 40764 31228 netio   IN   ?      0:05.81 dump -t -0 -u -f - -L usr0-241230 -a /usr 
>   999 25784  3916     0  85  0 99184 11072 semwait I    ?      0:00.26 /usr/pkg/sbin/httpd -k start 
>     0 25989  7120 31728 109  1  7144  2108 tstile  DN   ?      6:36.25 gzip -c 
>     0 28732 24698  2313  75 20 40640 33848 pause   IN   ?      0:14.90 dump -t -0 -u -f - -L usr0-241230 -a /usr 
>   999 29067  3916     0  85  0 98384  9328 semwait I    ?      0:00.05 /usr/pkg/sbin/httpd -k start 
>     0 29553 19597     0  85  1  6680  1964 wait    INs  ?      0:00.01 sh 
>   999 29617  3916     0  85  0 98784 10136 semwait I    ?      0:00.02 /usr/pkg/sbin/httpd -k start 
>   999 29830  3916     0  85  0 99104 11688 semwait I    ?      0:00.41 /usr/pkg/sbin/httpd -k start 
>   999 29967  3916     0  85  0 99104 11696 semwait I    ?      0:00.33 /usr/pkg/sbin/httpd -k start 
>  1001   439   598   855  85  0  7092  1924 pause   Is   pts/0  0:00.02 -ksh 
>     0   786   439     0  85  0  7048  1924 ttyraw  I+   pts/0  0:00.22 ksh 
>     0  6952 26465     0  85  0  7064  1928 pause   S    pts/1  0:00.21 ksh 
>     0 25197  6952     0  43  0  6896  1680 -       O+   pts/1  0:00.01 ps -axl 
>  1001 26465 21854     0  85  0  7108  1928 pause   Is   pts/1  0:00.02 -ksh 
>     0  2841  6887     0  85  0  8760  2988 ttyraw  I+   pts/2  0:00.20 vi /var/log/dump.log 
>     0  6887 21829     0  85  0  7048  1924 pause   I    pts/2  0:00.20 ksh 
> 32767 19597     1     0  85  0  6256  1324 wait    I    pts/2  0:00.00 /usr/libexec/atrun 
>  1001 21829  7261     0  85  0  7092  1924 pause   Is   pts/2  0:00.02 -ksh 
>     0   323     1     8  85  0  6788  1776 pause   S    ttyE0- 0:00.98 /bin/ksh /usr/local/apache/admin/watch_ap 
>     0   541     1   270  85  0  6284  1800 ttyraw  Is+  ttyE0  0:00.01 /usr/libexec/getty std.9600 ttyE0 
> +mercy$ ps -axl | grep tsti
>     0 25989  7120 31728 109  1  7144  2108 tstile  DN   ?      6:36.25 gzip -c 
> +mercy$ vmstat -s
>      4096 bytes per page
>         1 page color
>    250844 pages managed
>       885 pages free
>    153061 pages active
>     67759 pages inactive
>         0 pages paging
>      2936 pages wired
>         0 zero pages
>         1 reserve pagedaemon pages
>         5 reserve kernel pages
>      4915 boot kernel pages
>     25412 kernel pool pages
>     23900 anonymous pages
>    175666 cached file pages
>     24190 cached executable pages
>       256 minimum free pages
>       341 target free pages
>     83614 maximum wired pages
>         1 swap devices
>   1048697 swap pages
>         0 swap pages in use
>         0 swap allocations
>  20473827 total faults taken
>  20880765 traps
>  13907464 device interrupts
>  16647155 CPU context switches
>   6573417 software interrupts
>         0 system calls
>         0 pagein requests
>         0 pageout requests
>         0 pages swapped in
>         0 pages swapped out
>     26581 forks total
>       779 forks blocked parent
>       779 forks shared address space with parent
>         0 pagealloc zero wanted and avail
>   2320493 pagealloc zero wanted and not avail
>         0 aborts of idle page zeroing
>   6961321 pagealloc desired color avail
>         0 pagealloc desired color not avail
>   6961321 pagealloc local cpu avail
>         0 pagealloc local cpu not avail
>         0 faults with no memory
>         0 faults with no anons
>         0 faults had to wait on pages
>         0 faults found released page
>      5108 faults relock (5108 ok)
>   9027836 anon page faults
>         0 anon retry faults
>   1934161 amap copy faults
>   2080679 neighbour anon page faults
>  27697269 neighbour object page faults
>   8951102 locked pager get faults
>      5105 unlocked pager get faults
>   7566670 anon faults
>   1412090 anon copy on write faults
>   7688062 object faults
>   1263040 promote copy faults
>   2320090 promote zero fill faults
>      1056 times daemon wokeup
>      1056 revolutions of the clock hand
>   1173219 pages freed by daemon
>   1432806 pages scanned by daemon
>         0 anonymous pages scanned by daemon
>   1173219 object pages scanned by daemon
>      7994 pages reactivated
>        18 pages found busy by daemon
>         0 total pending pageouts
>   1734493 pages deactivated
>  20226193 total name lookups
>  18536033 good hits
>   1298945 negative hits
>       522 bad hits
>         0 false hits
>    348649 miss
>     42044 too long
>    251453 pass2 hits
>    256376 2passes
>           cache hits (91% pos + 6% neg) system 1% per-process
>           deletions 0%, falsehits 0%, toolong 0%
> 
> +mercy$ 
> 
-dgl-

> On Dec 21, 2024, at 9:21 AM, Martin Husemann <martin%duskware.de@localhost> wrote:
> 
> On Sat, Dec 21, 2024 at 02:20:23AM -0600, Don Lee wrote:
>> Any thought about how I should deal with this? My kernel debugging
>> skills are weak. Is there anything I can do to produce useful debug
>> information next time this happens?
> 
> A tstile just means the process/kernel thread is blocked for a longer
> time waiting for "something". In this context that would typically be
> either a disk buffer ("buf") read from disk (the source you are backing
> up) or a network buffer ("mbuf") handed to the network device output
> queue.
> 
> This would point to a bug in the disk driver or the network driver.
> 
> The other option would be your system runing out of kernel address space,
> so it can not allocate either type of buffer. This would show up in 
> the output of vmstat -s (or you are running against limits for either
> type of buffer, but I would expect that to fail differently).
> 
> Can you start new processes in the state when it hangs? Like the vmstat -s 
> mentioned above, or "crash" to look at kernel state? I am not sure "crash"
> would work on macppc and 9.x good enough.
> 
> Martin

+mercy$ ps -axl
UID   PID  PPID  CPU PRI NI   VSZ   RSS WCHAN   STAT TTY       TIME COMMAND
  0     0     0    0 125  0     0 67804 uvm     DKl  ?     96:15.96 [system]
  0     1     0    0  85  0  6644   284 wait    Is   ?      0:04.78 init 
  0    38     1    0  85  0  6532  1808 nanoslp I    ?      0:00.97 /usr/libexec/getty std.9600 ttyE1 
  0   172     1    0  85  0 11928  2424 kqueue  Ss   ?      3:46.75 /usr/sbin/syslogd 
  0   202     1    0  85  0 31920 15264 kqueue  Is   ?      3:10.08 /usr/pkg/sbin/named 
  0   237     1  142  85  0  6624  1568 select  Is   ?      0:00.00 /usr/sbin/lpd -s 
  0   373     1    0  85  0 11832 10872 pause   Ss   ?      1:53.24 /usr/sbin/ntpd -p /var/run/ntpd.pid 
  0   378     1 1826  85  0 18528  3048 select  Is   ?      2:31.59 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
  0   462     1    0  85  0  6532   732 nanoslp I    ?      0:01.18 /usr/libexec/getty std.9600 ttyE2 
  0   560     1    0  85  0 28428 10672 kqueue  Is   ?      0:00.95 /usr/pkg/sbin/lwresd 
  0   575     1    0  85  0  6528  1656 kqueue  Is   ?      0:00.19 /usr/sbin/inetd -l 
  0   606     1    0  85  0 17136  2740 kqueue  Ss   ?      0:22.73 /usr/libexec/postfix/master -w 
 12   635   606    0  85  0 16744  4912 kqueue  I    ?      0:02.88 qmgr -l -t unix -u 
  0   648     1    0  85  0  7276  1752 nanoslp Ss   ?      0:04.53 /usr/sbin/cron 
  0   786     1  228  85  0  6092    40 ttyraw  I    ?      0:00.00 /usr/libexec/getty std.38400 ttyZ0 
  0   805     1    0  85  0  6532  1808 nanoslp I    ?      0:00.67 /usr/libexec/getty std.9600 ttyE3 
999   981 24032    0  85  0 98912 11400 semwait S    ?      0:00.60 /usr/pkg/sbin/httpd -k start 
1043  1754 15863    0  85  0 19780  5272 select  S    ?      0:03.16 sshd: screview 
  0  4206 27475  679  85  0  6128  1548 nanoslp I    ?      0:00.00 sleep 150 
 12  4681   606    0  85  0 17008  4916 kqueue  I    ?      0:00.03 proxymap -t unix -u 
999  5475 24032    0  85  0 99136 11808 semwait S    ?      0:00.57 /usr/pkg/sbin/httpd -k start 
1001  6977 21060    0  85  0 19872  5284 select  S    ?      0:00.48 sshd: donlee@pts/2 (sshd)
999  7535 24032    0  85  0 98752 10756 kqueue  S    ?      0:00.08 /usr/pkg/sbin/httpd -k start 
999  8921 24032    0  85  0 99104 11420 semwait S    ?      0:00.54 /usr/pkg/sbin/httpd -k start 
1001  8972 27821    0  85  0 19776  5284 select  I    ?      0:00.50 sshd: donlee@pts/0 (sshd)
 12 11260   606    0  85  0 16984  4848 kqueue  S    ?      0:00.03 pickup -l -t unix -u 
999 14429 24032  967  85  0 36192  4280 socket  I    ?      0:00.07 /usr/pkg/sbin/httpd -k start 
  0 15863   378   22  85  0 19212  7340 select  Is   ?      0:00.25 sshd: screview [priv] 
  0 15888   378  960  85  0 19232  7424 select  Is   ?      0:00.36 sshd: donlee [priv] 
 12 16463   606    0  85  0 17816  5412 kqueue  I    ?      0:00.03 smtpd -n smtp -t inet -u -o stress= -s 2 
1001 16576 18110    0  85  0 19872  5280 select  I    ?      0:00.33 sshd: donlee@pts/3 (sshd)
  0 18110   378 4580  84  0 19328  7440 select  Is   ?      0:00.37 sshd: donlee [priv] 
 12 18451   606    0  85  0 16976  4832 kqueue  S    ?      0:00.03 anvil -l -t unix -u 
999 18588 24032    0  85  0 98608 10588 semwait S    ?      0:00.08 /usr/pkg/sbin/httpd -k start 
  0 21060   378 4284  84  0 19328  7440 select  Is   ?      0:00.38 sshd: donlee [priv] 
1001 22743 15888    0  85  0 19776  5280 select  I    ?      0:00.79 sshd: donlee@pts/1 (sshd)
999 23402 24032    0  85  0 98752 10820 semwait S    ?      0:00.27 /usr/pkg/sbin/httpd -k start 
  0 24032     1    0  85  0 97904 22004 select  Ss   ?      0:03.20 /usr/pkg/sbin/httpd -k start 
  0 27475     1  679  85  0  6788  1840 pause   I    ?      0:01.78 /bin/ksh /usr/local/apache/admin/watch_ap 
  0 27821   378  263  85  0 19232  7428 select  Is   ?      0:00.37 sshd: donlee [priv] 
999 29660 24032    0  85  0 98752 10716 semwait S    ?      0:00.31 /usr/pkg/sbin/httpd -k start 
1001  5918  8972    0  85  0  7092  1924 ttyraw  Is+  pts/0  0:00.03 -ksh 
  0    80     1    0  76 19 40640 30044 pause   IN+  pts/1  0:00.44 dump -t -0 -u -f - -L usr0-241229 -a /usr 
  0  3132 10445    0  85  0  5768  1280 pipe_rd I+   pts/1  0:00.01 tee -a /var/log/dump.log 
1001  5899 22743    0  85  0  7128  1936 pause   Is   pts/1  0:00.02 -ksh 
  0  7387 10445    0  85  0  6780  1376 pause   I+   pts/1  0:00.01 /bin/ksh ./dodumps 
  0 10445 16554    0  85  0  6772  1836 pause   I+   pts/1  0:00.01 /bin/ksh ./dodumps 
  0 16554  5899    0  85  0  7048  1928 pause   I    pts/1  0:00.33 ksh 
  0 21288  7387    0 117  0  7144  2108 tstile  D+   pts/1  0:10.81 gzip -c 
  0 29716 10445    0  85  0  8508  2788 pipe_rd I+   pts/1  0:00.01 mail -s mercy weekly remote backup for 241229 root 
1001 21766  6977    0  85  0  7092  1924 pause   Is   pts/2  0:00.03 -ksh 
  0 26577 29993    0  43  0  6896  1680 -       O+   pts/2  0:00.01 ps -axl 
  0 29993 21766    0  85  0  7048  1924 pause   S    pts/2  0:00.20 ksh 
  0  8984 12914    0  85  0  7048  1924 ttyraw  I+   pts/3  0:00.19 ksh 
1001 12914 16576  113  85  0  7092  1924 pause   Is   pts/3  0:00.02 -ksh 
  0   694     1    0  85  0  6284  1800 ttyraw  Is+  ttyE0  0:00.01 /usr/libexec/getty std.9600 ttyE0 
+mercy$ ^C -axl
+mercy$ vmstat -s
    4096 bytes per page
       1 page color
  250844 pages managed
    2222 pages free
  140041 pages active
   69363 pages inactive
       0 pages paging
    2730 pages wired
       0 zero pages
       1 reserve pagedaemon pages
       5 reserve kernel pages
    4915 boot kernel pages
   35578 kernel pool pages
   23598 anonymous pages
  164759 cached file pages
   23777 cached executable pages
     256 minimum free pages
     341 target free pages
   83614 maximum wired pages
       1 swap devices
 1048697 swap pages
       0 swap pages in use
       0 swap allocations
225500305 total faults taken
228149211 traps
156287160 device interrupts
50546946 CPU context switches
63050311 software interrupts
       0 system calls
       0 pagein requests
       0 pageout requests
       0 pages swapped in
       0 pages swapped out
  392258 forks total
    9727 forks blocked parent
    9727 forks shared address space with parent
       0 pagealloc zero wanted and avail
32805864 pagealloc zero wanted and not avail
       0 aborts of idle page zeroing
80964508 pagealloc desired color avail
       0 pagealloc desired color not avail
80964508 pagealloc local cpu avail
       0 pagealloc local cpu not avail
       0 faults with no memory
       0 faults with no anons
       0 faults had to wait on pages
       0 faults found released page
   15090 faults relock (15090 ok)
52280209 anon page faults
       0 anon retry faults
29181270 amap copy faults
29921894 neighbour anon page faults
421666821 neighbour object page faults
137180816 locked pager get faults
   14978 unlocked pager get faults
30782693 anon faults
21465518 anon copy on write faults
118058653 object faults
19122163 promote copy faults
32802454 promote zero fill faults
    5143 times daemon wokeup
    5143 revolutions of the clock hand
 5724141 pages freed by daemon
 6745369 pages scanned by daemon
       0 anonymous pages scanned by daemon
 5724141 object pages scanned by daemon
   33115 pages reactivated
     233 pages found busy by daemon
       0 total pending pageouts
 6899800 pages deactivated
314091251 total name lookups
287646312 good hits
20255856 negative hits
    4097 bad hits
       0 false hits
 5508051 miss
  676935 too long
 3818493 pass2 hits
 3889436 2passes
         cache hits (91% pos + 6% neg) system 1% per-process
         deletions 0%, falsehits 0%, toolong 0%

+mercy$ 



Home | Main Index | Thread Index | Old Index