port-amd64: 32 bit linux java

Subject: 32 bit linux java
To: None <port-amd64@netbsd.org>
From: Arto Huusko <arto.huusko@pp2.inet.fi>
List: port-amd64
Date: 11/06/2007 12:32:05
Hello,

I've been trying to analyse the problem I'm seeing with 32 bit linux
java on NetBSD/amd64 (4.99.34, with the recent linux32 changes), but
I'm not getting very far.

The problem I'm now trying to resolve is that an ant compilation task
hangs (there are others, such as memory faults sometimes). When the
hang occurs, I see two java processes eating up all CPU time, and they
don't respond to SIGTERM, but SIGKILL does work. Both processes eat
mostly system time (about 90% sys, 10% user time). The machine
is dual core, and one of the processes sits on the other CPU, and
the other process on the other CPU. The states are RUN/0 and CPU/1.

I ktraced the ant job, and found out that the other process (both
processes are cloned so they are java threads) is in tight loop
calling sched_yield, while the other process is apparently not doing
anything at all.

The last line for the "idle" process in the ktrace dump is return
from syscall, and not a call to syscall. I did a few runs, and the
syscall returned from changes every time. It looks like the process
just stops doing work at random location, and that the system time
the process is eating is not due to a system call it made.
Maybe the process is being switched to, and some work done in kernel
before giving control to user land is looping, or it won't switch
to the process, but since there are no other ready processes, it
keeps on trying the same one.

I also checked the ktrace dump if some other process had sent a
signal to the "idle" process, and it was not caught, but there was
none.

Another interesting point is that the ant job sometimes finishes
correctly, indicating some race condition.


If there is more information I can dig up, just tell me what to
do. I am able to use ddb to analyze this further, but couldn't
by myself figure out anything useful what I could do with it.