Subject: Re: I want to rid ugly float load/stores used only for data movement
To: None <tech-toolchain@netbsd.org, port-macppc@netbsd.org,>
From: M L Riechers <mlr@rse.com>
List: tech-toolchain
Date: 02/23/2003 18:13:48
Thank you, Matt, for responding to my post.

First, I'd like to make sure that my comments are taken in the right
vein.  Other than my obvious bias and loathing wrt the use of float
load/stores for data movement, I'm not pushing any agenda here. I
truly am curious about the issues Jason referred to.  I'd just like to
know.

On Sat, 22 Feb 2003 23:33:20 -0800 Matt Thomas <matt@3am-software.com>
responded (taking last things first):

> BTW, do you have a small C snipet that shows this happening?

Errr, no, I don't think so, but it's more than likely that I'm
misunderstanding. I'd be glad to be obliged. Could you be a bit more
specific about what what's happening that you'd like to know?

All of my surmises were based on what I know of the powerpc kernel
code.  My statistics concerning float load/stores are based on a
counter I put into the kernel floating emulation trap and the emulator
itself.  My conclusions concerning the fact of gcc using float
load/stores are based on my using objdump -d to disassemble some
number of programs and visually inspect the program for what it's
doing.

> Floating point register are never saved or restored on a context switch.
> Each processor knows which was the last lwp that used FP (which "owns"
> the FP unit).  If another lwp request uses of FP unit, when the FP lazy
> switch code will save the current FP state into owner's state area,
> revoke the owner, set the owner to the new lwp, and then enable FP in
> the saved-MSR of the lwp.  Now FP instructions won't trap in the new
> owner but will trap in the old lwp.

Yes, precisely, and thank you for the amplification. 

I own up to taking a lot of "poetic license" with my description.  It
was not my intent to precisely describe the true train of events in
the kernel, but rather to sketch them for purposes of discussion.
Your "but will trap in the old lwp" is what I meant to imply by my "or
giving the float reg set up when some other process needs it", and my
bit about process having a claim on float registers was merely that,
from the using processes' point of view, the process has a right to
use the CPU's floating point resources, and has an expectation that
its data in FP regs will not be corrupted during context switches.
And my bit about the kernel never "issues" a floating point save area
was an emphasis on the fact that the kernel is unconcerned about
saving a process's FP regs until the process lays a claim to the FP
register set by actually using the FP unit, rather than say, noting
that in fact, it is true that memory for the FP register context is
prospectively reserved in the process's state area when the process
is created, regardless of whether that memory is ever used.

But I think it's important to keep our terms defined, and our
discussion limited.  For instance, what I mean by process is the thing
that's identified by a PID in the left-hand column of a top or ps
display.  I take it that you mean roughly the same when you say "lwp",
although the nearest translation I can make of this is "light-weight
process," which confuses me a bit. And I didn't think it necessary to
complicate matters by broadening the discussion to include the
requirements presented by multi-threading processes (which, to be
fair, you didn't do), or controlling multiple processors. But I'm very
likely to be wrong about that -- please set me straight if I am.
That's the very, and only, reason I bring all this up -- so I can be
set straight.

>>2.  As soon as a process uses a float instruction for any reason (say
>>     a gratuitous lfd/stfd instuction sequence solely for a data move),
>>     then the process is saddled with a claim on float registers to
>>     save/restore on a context switches -- although the same mechanism
>>     used to trap the float use event in the first place might bar the
>>     necessity to save/restore the registers, unless a float register
>>     is used again in a context switch return.
>
> Not true.

Here I'm truly puzzled (which is, of course my usual state). Which
isn't true? There are really three assertions here.

1.  The kernel is obligated to preserve a processes' FP registers once
    the process has used the FP unit.  If this is not true, then the
    process must assume that data that it puts in any given FP
    register will become corrupt at any random time.

2.  On first use of the FP unit, the process is tagged as using the FP
    unit.  Since you necessarily imply this point in your discussion
    above, can this be untrue?

3.  Given that a process is tagged as using the FP unit, then on a
    context switch, the kernel, one or the other, might theoretically:

    a.  restore a process's FP register contents to the FP registers,

    or,

    b.  reset the fpu fault trip bit in the msr, so that the kernel
	would only restore the FP register set if and when the process
	faulted on a subsequent FP instruction.

    Is there another realistic possibility?

But I take it that you're probably saying that one or the other point
3a or 3b is false, and the other is true.  I'm curious as to the
answer, but not enough at present to consult the code -- even though
it might have some bearing on the matter at hand.  I'm guessing that
in general 3a would be the better strategy under the theory that
having need of the FP, the user is going to use the heck out of it.
3b would be a better choice for a user not wanting FP resources but
unaware that gcc was getting them for him anyway.  And of course, you
could mix the two by profiling, if you reasoned that the effort is
worth it.  Well, those are practical matters best decided wrt the
facts as you find them.  But my money says you, for good and
sufficient reason, decided on 3a.

But my point for the moment was merely that, one way or the other,
resources would be consumed -- and one of those ways could be
triggered by as much as a single float/store caused by a gcc data
movement. If that results in a complete FP register set save/restore
on every context switch, so much the worse for arguing the efficacy of
doing floating point load/stores for data movement(s).  But I really
wasn't interested in parsing it that far at the time.  I would have
preferred to leave it to you, as the obviously expert witness, to point
out the facts, be they lie one way or the other.

>>3.  If most processes unmeaningly hit a gratuitous float event on most
>>     context switches, then the kernel will be forced to spend a
>>     significant amount of time saving/restoring float register sets.
>
> Not true.

Again, which isn't true?  Is your point that the kernel doesn't do
this, or that it does do this, but the resources needed to
save/restore the float registers is really insignificant?  If the
required resources are insignificant, what are they insignificant with
regard to?  After all, a 64 word write or read to or from main memory
is not an event that can pass unnoticed: it's more than a minor blurp
to the cpu.  And, these are floating point registers we're talking
about here -- they may or may not be optimized to access the cache
line -- it depends on the implementation. If it happens a lot, it will
cause talk.

>>4.  A process that did not use float at all might nevertheless be
>>     affected by some (increased) unpredictability in its context
>>     switch-in or switch-out, or in some other way.  (Hmmm, I really
>>     haven't thought that one through, but it occurs to me as a
>>     possibility.
>
> Not true

Hmmm, here we get to the nubbin of things, and where I'm intrigued.
I'm quite prepared to accept your answer, but I wish you would amplify
it with a discussion.  After all, this is a system where you can
generally rely on the notion that "_everything_ affects everything
else".

>>5.  And, (not specifically following from your comments, but I'd like
>>     to point out anyway), the trap to the kernel on first float
>>     instruction use is itself a non-trivial event, which arguably
>>     should happen only if the process is sincere in its intent to use
>>     floating point arithmetic.  Presumably, this alone would
>>     disqualify the use of float resources for trivial memory to memory
>>     data movements.
>
>This I agree with.

Well, you and I are in hearty agreement here, and, obviously, this
point speaks loudly to me. I'm sure it's worthy of its own discussion.
However, I can't see that it's relevant to Jason's points, and as
such, I rather back-handedly introduced it into this discussion.  I
could wish that someone with more knowledge would show a relation.

Again, please take my comments in the spirit that they are offered.  I
mean no offense, so if they seem so, just put it down to my lack of
skill as a diplomat.

Yours,

-Mike