Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: TSTILE lockups on disk dump/backups



Hi,

Can you please try kernel from 10.1 release (or -current)?

https://ftp.netbsd.org/pub/NetBSD/NetBSD-10.1/macppc/binary/sets/

It should be working with 9.x userland.

For netbsd-9, pmap for powerpc/oea is known to be broken.
But, unfortunately, it exceeds our capacity to pull up
fixes made to -current/netbsd-10 at the moment...

Thanks,
rin

On 2024/12/31 14:34, Don Lee wrote:
Another instance of the tstile hang with dump/gzip/nfs

ps -axl and vmstat-s below: (machine name is mercy)

+mercy$ ps -axl
   UID   PID  PPID   CPU PRI NI   VSZ   RSS WCHAN   STAT TTY       TIME COMMAND
     0     0     0     0 125  0     0 36844 uvm     DKl  ?      8:23.50 [system]
     0     1     0     0  85  0  6644  1688 wait    Is   ?      0:00.32 init
     0    38     1     0  85  0  6532  1808 nanoslp I    ?      0:00.12 /usr/libexec/getty std.9600 ttyE2
     0   107   367  1317  85  0 19328  7380 select  Is   ?      0:00.37 sshd: donlee [priv]
     0   172     1     0  85  0 12712  2428 kqueue  Ss   ?      0:14.56 /usr/sbin/syslogd
     0   202     1     0  85  0 30816 13452 kqueue  Is   ?      0:12.95 /usr/pkg/sbin/named
     0   255     1     0  85  0 12648 11688 pause   Ss   ?      0:08.38 /usr/sbin/ntpd -p /var/run/ntpd.pid
     0   343     1     0  85  0  6624  1568 select  Is   ?      0:00.00 /usr/sbin/lpd -s
     0   367     1     0  85  0 18528  3048 select  Is   ?      0:09.63 sshd: /usr/sbin/sshd [listener] 0 of 10-100 sta
     0   553     1   270  85  0  6092  1684 ttyraw  I    ?      0:00.00 /usr/libexec/getty std.38400 ttyZ0
     0   565     1     0  85  0  6532  1808 nanoslp I    ?      0:00.05 /usr/libexec/getty std.9600 ttyE1
     0   566     1     0  85  0  6528  1656 kqueue  Is   ?      0:00.04 /usr/sbin/inetd -l
  1043   597   659     0  85  0 19812  5232 select  I    ?      0:00.30 sshd: screview
  1001   598   107     0  85  0 19872  5268 select  I    ?      0:00.70 sshd: donlee@pts/0 (sshd)
     0   604     1     0  85  0  7276  1752 nanoslp Ss   ?      0:00.32 /usr/sbin/cron
     0   613     1     0  85  0  6532  1808 nanoslp I    ?      0:00.05 /usr/libexec/getty std.9600 ttyE3
     0   635     1     0  85  0 17136  2764 kqueue  Ss   ?      0:01.65 /usr/libexec/postfix/master -w
    12   654   635     0  85  0 17240  4980 kqueue  I    ?      0:00.25 qmgr -l -t unix -u
     0   656     1     0  85  0 28428 10672 kqueue  Is   ?      0:00.21 /usr/pkg/sbin/lwresd
     0   659   367   457  85  0 19212  7252 select  Is   ?      0:00.25 sshd: screview [priv]
   999  3231  3916     0  85  0 36192  4228 socket  I    ?      0:00.07 /usr/pkg/sbin/httpd -k start
    12  3877   635     0  85  0 16984  4848 kqueue  I    ?      0:00.03 pickup -l -t unix -u
     0  3916     1     0  85  0 97904 20532 select  Ss   ?      0:03.20 /usr/pkg/sbin/httpd -k start
     0  4184 24698  1966  75 20 40640 33848 pause   IN   ?      0:14.70 dump -t -0 -u -f - -L usr0-241230 -a /usr
     0  5695   367     0  85  0 19264  7440 select  Is   ?      0:00.26 sshd: donlee [priv]
   999  6831  3916     0  85  0 98944 10548 semwait I    ?      0:00.09 /usr/pkg/sbin/httpd -k start
     0  7068 13979     0  85  1  5768  1280 pipe_rd IN   ?      0:00.01 tee -a /var/log/dump.log
   999  7083  3916     0  85  0 98384  9340 kqueue  S    ?      0:00.05 /usr/pkg/sbin/httpd -k start
     0  7120 13979 30971  77  1  6780  1396 pause   IN   ?      0:00.01 /bin/ksh /root/tapedump/dodumps
  1001  7261  5695     0  85  0 19776  5276 select  I    ?      0:00.46 sshd: donlee@pts/2 (sshd)
     0  7427   323     8  85  0  6128  1548 nanoslp S    ?      0:00.00 sleep 150
     0  7719   367     0  85  0 19264  7440 select  Is   ?      0:00.25 sshd: donlee [priv]
   999  8401  3916     0  85  0 98416 10272 semwait I    ?      0:00.09 /usr/pkg/sbin/httpd -k start
     0  9902 13979     0  85  1  8252  2596 pipe_rd IN   ?      0:00.01 mail -s mercy weekly remote backup for 241230 r
     0 13979 29553    81  85  1  6772  1812 pause   IN   ?      0:00.01 /bin/ksh /root/tapedump/dodumps
     0 17715  7120 26444  69 20 40640 35328 wait    IN   ?      0:01.52 dump -t -0 -u -f - -L usr0-241230 -a /usr
   999 20660  3916     0  85  0 98416 10280 semwait I    ?      0:00.09 /usr/pkg/sbin/httpd -k start
  1001 21854  7719     0  85  0 19776  5276 select  S    ?      0:00.57 sshd: donlee@pts/1 (sshd)
     0 22564 24698  2757  75 20 40640 33848 pipe_wr IN   ?      0:14.75 dump -t -0 -u -f - -L usr0-241230 -a /usr
     0 24698 17715  1430  75 20 40764 31228 netio   IN   ?      0:05.81 dump -t -0 -u -f - -L usr0-241230 -a /usr
   999 25784  3916     0  85  0 99184 11072 semwait I    ?      0:00.26 /usr/pkg/sbin/httpd -k start
     0 25989  7120 31728 109  1  7144  2108 tstile  DN   ?      6:36.25 gzip -c
     0 28732 24698  2313  75 20 40640 33848 pause   IN   ?      0:14.90 dump -t -0 -u -f - -L usr0-241230 -a /usr
   999 29067  3916     0  85  0 98384  9328 semwait I    ?      0:00.05 /usr/pkg/sbin/httpd -k start
     0 29553 19597     0  85  1  6680  1964 wait    INs  ?      0:00.01 sh
   999 29617  3916     0  85  0 98784 10136 semwait I    ?      0:00.02 /usr/pkg/sbin/httpd -k start
   999 29830  3916     0  85  0 99104 11688 semwait I    ?      0:00.41 /usr/pkg/sbin/httpd -k start
   999 29967  3916     0  85  0 99104 11696 semwait I    ?      0:00.33 /usr/pkg/sbin/httpd -k start
  1001   439   598   855  85  0  7092  1924 pause   Is   pts/0  0:00.02 -ksh
     0   786   439     0  85  0  7048  1924 ttyraw  I+   pts/0  0:00.22 ksh
     0  6952 26465     0  85  0  7064  1928 pause   S    pts/1  0:00.21 ksh
     0 25197  6952     0  43  0  6896  1680 -       O+   pts/1  0:00.01 ps -axl
  1001 26465 21854     0  85  0  7108  1928 pause   Is   pts/1  0:00.02 -ksh
     0  2841  6887     0  85  0  8760  2988 ttyraw  I+   pts/2  0:00.20 vi /var/log/dump.log
     0  6887 21829     0  85  0  7048  1924 pause   I    pts/2  0:00.20 ksh
32767 19597     1     0  85  0  6256  1324 wait    I    pts/2  0:00.00 /usr/libexec/atrun
  1001 21829  7261     0  85  0  7092  1924 pause   Is   pts/2  0:00.02 -ksh
     0   323     1     8  85  0  6788  1776 pause   S    ttyE0- 0:00.98 /bin/ksh /usr/local/apache/admin/watch_ap
     0   541     1   270  85  0  6284  1800 ttyraw  Is+  ttyE0  0:00.01 /usr/libexec/getty std.9600 ttyE0
+mercy$ ps -axl | grep tsti
     0 25989  7120 31728 109  1  7144  2108 tstile  DN   ?      6:36.25 gzip -c
+mercy$ vmstat -s
      4096 bytes per page
         1 page color
    250844 pages managed
       885 pages free
    153061 pages active
     67759 pages inactive
         0 pages paging
      2936 pages wired
         0 zero pages
         1 reserve pagedaemon pages
         5 reserve kernel pages
      4915 boot kernel pages
     25412 kernel pool pages
     23900 anonymous pages
    175666 cached file pages
     24190 cached executable pages
       256 minimum free pages
       341 target free pages
     83614 maximum wired pages
         1 swap devices
   1048697 swap pages
         0 swap pages in use
         0 swap allocations
  20473827 total faults taken
  20880765 traps
  13907464 device interrupts
  16647155 CPU context switches
   6573417 software interrupts
         0 system calls
         0 pagein requests
         0 pageout requests
         0 pages swapped in
         0 pages swapped out
     26581 forks total
       779 forks blocked parent
       779 forks shared address space with parent
         0 pagealloc zero wanted and avail
   2320493 pagealloc zero wanted and not avail
         0 aborts of idle page zeroing
   6961321 pagealloc desired color avail
         0 pagealloc desired color not avail
   6961321 pagealloc local cpu avail
         0 pagealloc local cpu not avail
         0 faults with no memory
         0 faults with no anons
         0 faults had to wait on pages
         0 faults found released page
      5108 faults relock (5108 ok)
   9027836 anon page faults
         0 anon retry faults
   1934161 amap copy faults
   2080679 neighbour anon page faults
  27697269 neighbour object page faults
   8951102 locked pager get faults
      5105 unlocked pager get faults
   7566670 anon faults
   1412090 anon copy on write faults
   7688062 object faults
   1263040 promote copy faults
   2320090 promote zero fill faults
      1056 times daemon wokeup
      1056 revolutions of the clock hand
   1173219 pages freed by daemon
   1432806 pages scanned by daemon
         0 anonymous pages scanned by daemon
   1173219 object pages scanned by daemon
      7994 pages reactivated
        18 pages found busy by daemon
         0 total pending pageouts
   1734493 pages deactivated
  20226193 total name lookups
  18536033 good hits
   1298945 negative hits
       522 bad hits
         0 false hits
    348649 miss
     42044 too long
    251453 pass2 hits
    256376 2passes
           cache hits (91% pos + 6% neg) system 1% per-process
           deletions 0%, falsehits 0%, toolong 0%

+mercy$

-dgl-

On Dec 21, 2024, at 9:21 AM, Martin Husemann <martin%duskware.de@localhost> wrote:

On Sat, Dec 21, 2024 at 02:20:23AM -0600, Don Lee wrote:
Any thought about how I should deal with this? My kernel debugging
skills are weak. Is there anything I can do to produce useful debug
information next time this happens?

A tstile just means the process/kernel thread is blocked for a longer
time waiting for "something". In this context that would typically be
either a disk buffer ("buf") read from disk (the source you are backing
up) or a network buffer ("mbuf") handed to the network device output
queue.

This would point to a bug in the disk driver or the network driver.

The other option would be your system runing out of kernel address space,
so it can not allocate either type of buffer. This would show up in
the output of vmstat -s (or you are running against limits for either
type of buffer, but I would expect that to fail differently).

Can you start new processes in the state when it hangs? Like the vmstat -s
mentioned above, or "crash" to look at kernel state? I am not sure "crash"
would work on macppc and 9.x good enough.

Martin

+mercy$ ps -axl
UID   PID  PPID  CPU PRI NI   VSZ   RSS WCHAN   STAT TTY       TIME COMMAND
   0     0     0    0 125  0     0 67804 uvm     DKl  ?     96:15.96 [system]
   0     1     0    0  85  0  6644   284 wait    Is   ?      0:04.78 init
   0    38     1    0  85  0  6532  1808 nanoslp I    ?      0:00.97 /usr/libexec/getty std.9600 ttyE1
   0   172     1    0  85  0 11928  2424 kqueue  Ss   ?      3:46.75 /usr/sbin/syslogd
   0   202     1    0  85  0 31920 15264 kqueue  Is   ?      3:10.08 /usr/pkg/sbin/named
   0   237     1  142  85  0  6624  1568 select  Is   ?      0:00.00 /usr/sbin/lpd -s
   0   373     1    0  85  0 11832 10872 pause   Ss   ?      1:53.24 /usr/sbin/ntpd -p /var/run/ntpd.pid
   0   378     1 1826  85  0 18528  3048 select  Is   ?      2:31.59 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
   0   462     1    0  85  0  6532   732 nanoslp I    ?      0:01.18 /usr/libexec/getty std.9600 ttyE2
   0   560     1    0  85  0 28428 10672 kqueue  Is   ?      0:00.95 /usr/pkg/sbin/lwresd
   0   575     1    0  85  0  6528  1656 kqueue  Is   ?      0:00.19 /usr/sbin/inetd -l
   0   606     1    0  85  0 17136  2740 kqueue  Ss   ?      0:22.73 /usr/libexec/postfix/master -w
  12   635   606    0  85  0 16744  4912 kqueue  I    ?      0:02.88 qmgr -l -t unix -u
   0   648     1    0  85  0  7276  1752 nanoslp Ss   ?      0:04.53 /usr/sbin/cron
   0   786     1  228  85  0  6092    40 ttyraw  I    ?      0:00.00 /usr/libexec/getty std.38400 ttyZ0
   0   805     1    0  85  0  6532  1808 nanoslp I    ?      0:00.67 /usr/libexec/getty std.9600 ttyE3
999   981 24032    0  85  0 98912 11400 semwait S    ?      0:00.60 /usr/pkg/sbin/httpd -k start
1043  1754 15863    0  85  0 19780  5272 select  S    ?      0:03.16 sshd: screview
   0  4206 27475  679  85  0  6128  1548 nanoslp I    ?      0:00.00 sleep 150
  12  4681   606    0  85  0 17008  4916 kqueue  I    ?      0:00.03 proxymap -t unix -u
999  5475 24032    0  85  0 99136 11808 semwait S    ?      0:00.57 /usr/pkg/sbin/httpd -k start
1001  6977 21060    0  85  0 19872  5284 select  S    ?      0:00.48 sshd: donlee@pts/2 (sshd)
999  7535 24032    0  85  0 98752 10756 kqueue  S    ?      0:00.08 /usr/pkg/sbin/httpd -k start
999  8921 24032    0  85  0 99104 11420 semwait S    ?      0:00.54 /usr/pkg/sbin/httpd -k start
1001  8972 27821    0  85  0 19776  5284 select  I    ?      0:00.50 sshd: donlee@pts/0 (sshd)
  12 11260   606    0  85  0 16984  4848 kqueue  S    ?      0:00.03 pickup -l -t unix -u
999 14429 24032  967  85  0 36192  4280 socket  I    ?      0:00.07 /usr/pkg/sbin/httpd -k start
   0 15863   378   22  85  0 19212  7340 select  Is   ?      0:00.25 sshd: screview [priv]
   0 15888   378  960  85  0 19232  7424 select  Is   ?      0:00.36 sshd: donlee [priv]
  12 16463   606    0  85  0 17816  5412 kqueue  I    ?      0:00.03 smtpd -n smtp -t inet -u -o stress= -s 2
1001 16576 18110    0  85  0 19872  5280 select  I    ?      0:00.33 sshd: donlee@pts/3 (sshd)
   0 18110   378 4580  84  0 19328  7440 select  Is   ?      0:00.37 sshd: donlee [priv]
  12 18451   606    0  85  0 16976  4832 kqueue  S    ?      0:00.03 anvil -l -t unix -u
999 18588 24032    0  85  0 98608 10588 semwait S    ?      0:00.08 /usr/pkg/sbin/httpd -k start
   0 21060   378 4284  84  0 19328  7440 select  Is   ?      0:00.38 sshd: donlee [priv]
1001 22743 15888    0  85  0 19776  5280 select  I    ?      0:00.79 sshd: donlee@pts/1 (sshd)
999 23402 24032    0  85  0 98752 10820 semwait S    ?      0:00.27 /usr/pkg/sbin/httpd -k start
   0 24032     1    0  85  0 97904 22004 select  Ss   ?      0:03.20 /usr/pkg/sbin/httpd -k start
   0 27475     1  679  85  0  6788  1840 pause   I    ?      0:01.78 /bin/ksh /usr/local/apache/admin/watch_ap
   0 27821   378  263  85  0 19232  7428 select  Is   ?      0:00.37 sshd: donlee [priv]
999 29660 24032    0  85  0 98752 10716 semwait S    ?      0:00.31 /usr/pkg/sbin/httpd -k start
1001  5918  8972    0  85  0  7092  1924 ttyraw  Is+  pts/0  0:00.03 -ksh
   0    80     1    0  76 19 40640 30044 pause   IN+  pts/1  0:00.44 dump -t -0 -u -f - -L usr0-241229 -a /usr
   0  3132 10445    0  85  0  5768  1280 pipe_rd I+   pts/1  0:00.01 tee -a /var/log/dump.log
1001  5899 22743    0  85  0  7128  1936 pause   Is   pts/1  0:00.02 -ksh
   0  7387 10445    0  85  0  6780  1376 pause   I+   pts/1  0:00.01 /bin/ksh ./dodumps
   0 10445 16554    0  85  0  6772  1836 pause   I+   pts/1  0:00.01 /bin/ksh ./dodumps
   0 16554  5899    0  85  0  7048  1928 pause   I    pts/1  0:00.33 ksh
   0 21288  7387    0 117  0  7144  2108 tstile  D+   pts/1  0:10.81 gzip -c
   0 29716 10445    0  85  0  8508  2788 pipe_rd I+   pts/1  0:00.01 mail -s mercy weekly remote backup for 241229 root
1001 21766  6977    0  85  0  7092  1924 pause   Is   pts/2  0:00.03 -ksh
   0 26577 29993    0  43  0  6896  1680 -       O+   pts/2  0:00.01 ps -axl
   0 29993 21766    0  85  0  7048  1924 pause   S    pts/2  0:00.20 ksh
   0  8984 12914    0  85  0  7048  1924 ttyraw  I+   pts/3  0:00.19 ksh
1001 12914 16576  113  85  0  7092  1924 pause   Is   pts/3  0:00.02 -ksh
   0   694     1    0  85  0  6284  1800 ttyraw  Is+  ttyE0  0:00.01 /usr/libexec/getty std.9600 ttyE0
+mercy$ ^C -axl
+mercy$ vmstat -s
     4096 bytes per page
        1 page color
   250844 pages managed
     2222 pages free
   140041 pages active
    69363 pages inactive
        0 pages paging
     2730 pages wired
        0 zero pages
        1 reserve pagedaemon pages
        5 reserve kernel pages
     4915 boot kernel pages
    35578 kernel pool pages
    23598 anonymous pages
   164759 cached file pages
    23777 cached executable pages
      256 minimum free pages
      341 target free pages
    83614 maximum wired pages
        1 swap devices
  1048697 swap pages
        0 swap pages in use
        0 swap allocations
225500305 total faults taken
228149211 traps
156287160 device interrupts
50546946 CPU context switches
63050311 software interrupts
        0 system calls
        0 pagein requests
        0 pageout requests
        0 pages swapped in
        0 pages swapped out
   392258 forks total
     9727 forks blocked parent
     9727 forks shared address space with parent
        0 pagealloc zero wanted and avail
32805864 pagealloc zero wanted and not avail
        0 aborts of idle page zeroing
80964508 pagealloc desired color avail
        0 pagealloc desired color not avail
80964508 pagealloc local cpu avail
        0 pagealloc local cpu not avail
        0 faults with no memory
        0 faults with no anons
        0 faults had to wait on pages
        0 faults found released page
    15090 faults relock (15090 ok)
52280209 anon page faults
        0 anon retry faults
29181270 amap copy faults
29921894 neighbour anon page faults
421666821 neighbour object page faults
137180816 locked pager get faults
    14978 unlocked pager get faults
30782693 anon faults
21465518 anon copy on write faults
118058653 object faults
19122163 promote copy faults
32802454 promote zero fill faults
     5143 times daemon wokeup
     5143 revolutions of the clock hand
  5724141 pages freed by daemon
  6745369 pages scanned by daemon
        0 anonymous pages scanned by daemon
  5724141 object pages scanned by daemon
    33115 pages reactivated
      233 pages found busy by daemon
        0 total pending pageouts
  6899800 pages deactivated
314091251 total name lookups
287646312 good hits
20255856 negative hits
     4097 bad hits
        0 false hits
  5508051 miss
   676935 too long
  3818493 pass2 hits
  3889436 2passes
          cache hits (91% pos + 6% neg) system 1% per-process
          deletions 0%, falsehits 0%, toolong 0%

+mercy$




Home | Main Index | Thread Index | Old Index