port-arm32: Re: 'program cc1 got fatal signal 11'

Subject: Re: 'program cc1 got fatal signal 11'
To: William Gallafent <William.Gallafent@comlab.ox.ac.uk>
From: Neil A. Carson <neil@causality.com>
List: port-arm32
Date: 04/27/1998 16:58:46
Greetings,

I'll reply to all of these messages in one go as there are rather a
lot... I'm sure Chris at DEC can clarify exactly when things
appeared/went, but...

Rev J and K StrongARMs had the aformentioned STM^ bug which is reported
at boot time, and has been worked around in the kernel and FPE code.
That happens (from memory) when a cache line fill is completing (or a
write buffer drain was happening, can't remember) and could cause the
update register to get trashed when a hat was used. Acorn worked round
that the same way as us, it wasn't a serious problem. The bug I believe
that people are hitting with the compiler set (as to why I think this is
the bug I'll outline later, I could be wrong) will affect RiscBSD (and I
guess Linux as well) but _not_ RiscOS---which I guess is why Acorn never
told anyone about it although it is in the StrongARM errata now.

The problem comes when there's an LDM instruction on the end of a page
and the next page is not mapped. Can't remember if something also has to
be happening with the write buffer, or if there has to be writeback on a
register, or not. But that's basically it, anyway. When these in this
certain condition are happening, it means that an oft unrecoverable
fault will be taken during the instruction, which shouldn't happen.
Certainly with the Revision S StrongARMs this has been ironed out, but I
know it's there in the Rev J and Rev K silicon (look at the label on the
chip---on my Shark the chip says SA110-S in the middle). Sometimes this
will result in a binary banging out, and sometimes it will result in the
machine just slowing down a lot whilst it faults and faults and faults.

Workarounds: As you can see, there aren't really any. One way would be
to assemble the instruction twice and NOP one of them out. Another way
would be to patch the binary afterwards (Pete Burwood wrote something to
do this) but very occasionally this would not be possible to do. The
best thing now I guess it to shuffle the LDMs around in the binary and
get lucky which is what's happened everywhere so far :-)

Note that some of the compiler sets in the past were legitimately
broken. However 1.3.1 works just fine. The reason we expect the 1.3.1
compiler to be 'unlucky' is that Simon Levett (who's working on a fine
software port for RiscBSD, but I'll leave him to spill the beans!) was
trying to rebuild his software under 1.3.1 and complained of the
compiler bombing out. Mark tried this on his systems and he'd been
building hundreds of thousands of lines of code fine! We then suggested
Simon pop the 610 back in, whereupon the 'broken compiler' started
working fine!

So, there could be another force at work with something that's just
broken with StrongARM on the version of the OS that people other than us
are using, but I doubt it very much.

It was unclear for a while as to whether the presence of this bug was an
NDA thingy or not. But now these chips are so old, and DEC have
disclosed the information to others without NDAs on several occasions so
I'd now consider that people who have a need to know (ie those with
crashing compilers!) might as well know. Probably best not spread the
information too much, please, though as it'll just frighten RiscOS users
needlessly.

I actually pointed this out in a newsgroup a long time ago, but I guess
it was caught in the wash. In fact I have a hypothesis that when we move
over to UVM that this problem may go away in severity a fair bit, but
we'll have to wait and see.

	Regards,

	Neil