Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: "Stalled" system and runaway forking of 'cron' processes on macppc-6.99.28



This sounds *very* familiar, and I believe that the bug - and I believe
it is an OS bug - is in the architecture independent code.

I can reproduce these stalls, and if you break into the debugger, you will
find many processes stopped on TSTILE. This appears to me to be a locking
problem. An interrupt handler grabbing a lock, per chance?

I have been unable to nail it down, partly because it is hard to reproduce.
I have a test case or two, but it is intermittent, and sometimes takes hours to
trigger.

I have two separate scenarios. The "hard" hang, and the "soft" one.

The "hard hang" always involves local, ATA disk. It looks like what you
describe. Stuff is still running, pings get responses. TCP connects
work, but no new procs can start. If you break into kernel debugger, you
see everybody (a lot of procs) waiting on TSTILE.

The "soft" hang is the same, with a difference. The soft hang
is "soft" because the disk activity being done is over NFS, and is
therefore interruptible. I have a workaround in place where a script watches
for "hung" processes and whacks them upside the head with a signal
if they don't respond.

It's not pretty, but it keeps my web servers up.

The test case(s) are all apache 2 exercise scripts. In production, this
appears to happen frequently when my web server hits heavy load (dozens
of requests a second sustained)

I have never caught the NFS-hang in the debugger, so I don't KNOW that
it's the same, but my nose says they  are close cousins.

I would be happy to help debug this if someone would provide some expertise.
My kernel debugger skills in PPC are minimal.

-dgl-

P.S. There is also a bug that I have confirmed in the PPC kernel that was
caused by a mod applied between
>> -D 2011-05-01 looks clean
>> -D 2011-05-03 looks bad.
That bug causes the statistics reported by analog (apache log file
analyzer) to be corrupted. I have suspicions that this bug is also the
source of user-level instability on the PPC port.

There is one mod from Matt Thomas to -current on May 1, 2011, that
re-works a bunch of CPU handling on PPC, and it looks to me like it
has a serious bug. Look in this list around March 2013 for:
"Confidence: Chopping between 5.2 and 6.0.1" for more detail


>For the last several weeks, I've been observing the following behavior
>on a diskless macppc system (PowerBook G4 667 "TiBook") with native
>Xorg:
>
>some rxvt-unicode windows stop updating/repainting text (need to observe
>with in-tree xterm).
>
>rapid updating text in rxvt-unicode window appears to crash that process.
>(again, need to observe with in-tree xterm).
>
>system appears to stall:  idle shell windows will redraw prompt, but not
>fork additional processes.
>
>Any interactive applications already running continue to function
>(ssh sessions, web browsers, etc.)
>
>Screen savers (xlock, xscreensaver) will activate, but authenticating
>to unlock the screen hangs if successful (if unsuccessful, screen saver
>resumes as normal).
>
>To try to see what might be going on, I kept a terminal window running
>'top' and what I observed is an ever-increasing number of processes
>running 'cron'.  When I first noticed, there were about 23 of them.
>They appear to add 2 processes about every 10 minutes.  As a result, my
>'top' display now looks like:
>
>load averages:  0.05,  0.04,  0.00;               up 0+18:37:21        12:27:57
>229 processes: 92 runnable, 135 sleeping, 1 stopped, 1 on CPU
>CPU states:  2.6% user,  0.0% nice,  0.6% system,  0.0% interrupt, 96.8% idle
>Memory: 205M Act, 96K Inact, 17M Wired, 26M Exec, 63M File, 725M Free
>Swap: 1024M Total, 1024M Free
>
>  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
> 1107 jdbaker   85    0    44M   36M select     8:25  0.00%  0.00% links
>  887 root      85    0    56M   42M select     6:31  0.00%  0.00% Xorg
>    0 root      96    0     0K 7720K sopendfr   1:24  0.00%  0.00% [system]
>  378 jdbaker   43    0  4740K 2564K CPU        0:44  0.00%  0.00% top
>  842 jdbaker   85    0    12M 6064K select     0:42  0.00%  0.00% xclock
> 1484 jdbaker   85    0    23M   16M select     0:29  0.00%  0.00% xv
>  134 jdbaker   85    0  6980K 2976K select     0:14  0.00%  0.00% FvwmPager
>  947 jdbaker   85    0  8532K 4172K select     0:07  0.00%  0.00% fvwm
>  280 ntpd      85    0  8228K 7944K pause      0:05  0.00%  0.00% ntpd
>  690 jdbaker   85    0    27M 8844K kqueue     0:04  0.00%  0.00% urxvt
> 1214 jdbaker   85    0    27M 9080K kqueue     0:03  0.00%  0.00% urxvt
>  389 jdbaker   85    0  8924K 3412K select     0:03  0.00%  0.00% xload
>   97 jdbaker   85    0  6908K 2936K select     0:03  0.00%  0.00% FvwmIconMan
>  357 root      85    0  7932K 9404K select     0:02  0.00%  0.00% amd
>  699 root      85    0    11M 2488K kqueue     0:02  0.00%  0.00% master
> 1590 jdbaker   85    0    27M 9060K kqueue     0:01  0.00%  0.00% urxvt
> 1814 jdbaker   85    0    10M 4840K select     0:01  0.00%  0.00% ssh
>  323 jdbaker   85    0  8924K 3412K select     0:01  0.00%  0.00% xbiff
>21592 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>21401 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>23121 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22351 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 1609 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22291 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22019 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22139 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 1017 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 1271 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>21749 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 1268 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>23152 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22126 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>21996 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>  481 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>23519 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 3037 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>24028 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>23126 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 1108 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>11720 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>12998 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22468 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>  707 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>  831 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>23997 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>  827 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>20793 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>22071 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
> 1717 root       0    0     0K    0K RUN        0:00  0.00%  0.00% cron
>
>I don't know whether this is just a symptom or a cause of the "stall".
>Most of the time, the system indicates 100% idle (the 96.8% above likely
>is due to the clipboard operation to copy/paste the display).
>
>I need to get a better picture of just how soon the behavior starts.
>Right now, I don't think it's related to high-load activity such as
>the nightly maintenance operations as I think it started before then.
>
>An interesting side effect seems to be that in this state, the mouse
>pointer remains functional (before this started happening, it would lose
>one or both directions, behaving as though it were constantly being moved
>to the top, left or top-left of the display after about an hour).
>
>-- 
>|/"\ John D. Baker, KN5UKS               NetBSD     Darwin/MacOS X
>|\ / jdbaker[snail]mylinuxisp[flyspeck]com    OpenBSD            FreeBSD
>| X  No HTML/proprietary data in email.   BSD just sits there and works!
>|/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645



Home | Main Index | Thread Index | Old Index