port-arm32: High %age in sys on SA

Subject: High %age in sys on SA
To: None <port-arm32@NetBSD.ORG>
From: Peter Burwood <riscbsd@arcangel.dircon.co.uk>
List: port-arm32
Date: 12/21/1996 21:48:19
Hi,

I'm trying to get GNAT working properly on the SA and have just come
across another problem. Initially, I couldn't use the 3.03 version that
I released because I would get 'Data abort: Translation fault' on the SA
but not with the ARM 6. GNAT was built under RiscBSD 1.1, so I thought
that might be the problem. Upgrading from SA-4740 to SA-4871 (with bug
fixed FPE) did not help as I thought it would (sorry Mark). There is no
self-modifying code in my version of the compiler (the released version
would very rarely generate a trampoline on the stack which wouldn't work
on the SA).

So, I rebuilt GNAT with the RiscBSD 1.2 with the ARM 6 and that worked
fine. I then tested GNAT with the SA and some of the source which the SA
would fail on before now didn't. (Slight possibility that this was
related to a using 2.7.2.1 instead of 2.7.2, GNAT 3.04 instead of 3.03
and not using -fomit-frame-pointer - okay a lot of changes, but when a
stage3 compiler takes 12 hours+ to build and the old one was perfectly
stable under ARM 6 I thought I'd start the move to the current version).

However, I have now noticed that some compilation times are massive on
the SA, but normal on the ARM 6. A simple source file might take minutes
to compile instead of 1-2 seconds real time. Looking at the system with
'systat vmstat' showed that the process was using 95%+ cpu, with the
process being in sys 90%+ of the time. For example, one file took more
than 10 minutes on the SA when the ARM 6 would compile it in under 30
seconds.

I tried using gdb on the process (by specifying the program name and the
pid of the process), but that didn't help me track down the problem. All
it did was print lots and lots of pmap_nightmare messages whenever I
tried looking at anything.

Dec 21 18:00:46 arcangel /netbsd: pmap_nightmare: w=0 p=2 va=f2d27000 c=0
Dec 21 18:01:09 arcangel last message repeated 92 times
Dec 21 18:03:18 arcangel last message repeated 147 times

Though, that might just be a problem with gdb being beta.

I've subsequently tried running gdb on the program and running it
totally within the single process. After waiting for a while (to allow
the process to get in its 'stuck' state), I interrupted the process to
get back to the gdb prompt. Continuing from here and subsequent
interrupts always interrupted the program at the same address (in the
middle of a 4k page (0x10276c)). Disassembling at that location produced no
pmap_nightmare messages, but stepi produced 3 or 4. After a few of
these, I tried step and a couple of those, producing about 200
pmap_nightmare messages, allowed the program to complete.

My ideas are that the system is thrashing on a page load for some reason
(plenty of free memory though) or perhaps something to do with the MMU's
translation buffers (since there is absolutely no disk activity on the
disc holding the program). Perhaps there is some bug with the page abort
code of the SA ? (Is it late abort like the ARM 7 ?) Remember, that
initially I had a compiler working okay on ARM 6 but would fail with
repeatable Data Aborts on the SA.

A more likely explanation, though one that is equally weird, is that the
system is thrashing spending its time in cache flushing. But why doesn't
this effect all compiles and a lot of other programs ?

Note, any of these explanations must take into account that some source
thrown through the compiler doesn't exhibit this problem, but that when
a source file does, it always exhibits the problem. This is why I think
the problem might be page related - triggered by some sections of the
compiler.

Well, a lot of rambling, but maybe it will provoke some thoughts.


Probably unrelated, but I noticed that the SA takes longer to boot that
the ARM 6. Note, for example, the long time the SA sits after it has
printed the Copyright line during the early part of the boot sequence
compared with the delay here with an ARM 6. Perhaps this is just due to
a lot of cache flushing occuring at startup.

regards,
Pete