Subject: Re: I want to rid ugly float load/stores used only for data
To: M L Riechers <mlr@rse.com>
From: Matt Thomas <matt@3am-software.com>
List: port-powerpc
Date: 02/23/2003 19:03:07
At 03:13 PM 2/23/2003, M L Riechers wrote:
>Thank you, Matt, for responding to my post.
>
>First, I'd like to make sure that my comments are taken in the right
>vein. Other than my obvious bias and loathing wrt the use of float
>load/stores for data movement, I'm not pushing any agenda here. I
>truly am curious about the issues Jason referred to. I'd just like to
>know.
>
>On Sat, 22 Feb 2003 23:33:20 -0800 Matt Thomas <matt@3am-software.com>
>responded (taking last things first):
>
> > BTW, do you have a small C snipet that shows this happening?
>
>Errr, no, I don't think so, but it's more than likely that I'm
>misunderstanding. I'd be glad to be obliged. Could you be a bit more
>specific about what what's happening that you'd like to know?
Ok. I know have a handle on why gcc emits FP instructions for non-FP
uses. One is if the function uses stdarg/vararg. The second is for
blockloads.
>All of my surmises were based on what I know of the powerpc kernel
>code. My statistics concerning float load/stores are based on a
>counter I put into the kernel floating emulation trap and the emulator
>itself. My conclusions concerning the fact of gcc using float
>load/stores are based on my using objdump -d to disassemble some
>number of programs and visually inspect the program for what it's
>doing.
>
> > Floating point register are never saved or restored on a context switch.
> > Each processor knows which was the last lwp that used FP (which "owns"
> > the FP unit). If another lwp request uses of FP unit, when the FP lazy
> > switch code will save the current FP state into owner's state area,
> > revoke the owner, set the owner to the new lwp, and then enable FP in
> > the saved-MSR of the lwp. Now FP instructions won't trap in the new
> > owner but will trap in the old lwp.
>
>Yes, precisely, and thank you for the amplification.
>
>I own up to taking a lot of "poetic license" with my description. It
>was not my intent to precisely describe the true train of events in
>the kernel, but rather to sketch them for purposes of discussion.
>Your "but will trap in the old lwp" is what I meant to imply by my "or
>giving the float reg set up when some other process needs it", and my
>bit about process having a claim on float registers was merely that,
>from the using processes' point of view, the process has a right to
>use the CPU's floating point resources, and has an expectation that
>its data in FP regs will not be corrupted during context switches.
>And my bit about the kernel never "issues" a floating point save area
>was an emphasis on the fact that the kernel is unconcerned about
>saving a process's FP regs until the process lays a claim to the FP
>register set by actually using the FP unit, rather than say, noting
>that in fact, it is true that memory for the FP register context is
>prospectively reserved in the process's state area when the process
>is created, regardless of whether that memory is ever used.
All processes have a FP save area. It's just not initialized until
an FP instruction is issued.
>But I think it's important to keep our terms defined, and our
>discussion limited. For instance, what I mean by process is the thing
>that's identified by a PID in the left-hand column of a top or ps
>display. I take it that you mean roughly the same when you say "lwp",
>although the nearest translation I can make of this is "light-weight
>process," which confuses me a bit. And I didn't think it necessary to
>complicate matters by broadening the discussion to include the
>requirements presented by multi-threading processes (which, to be
>fair, you didn't do), or controlling multiple processors. But I'm very
>likely to be wrong about that -- please set me straight if I am.
>That's the very, and only, reason I bring all this up -- so I can be
>set straight.
FP state is associated with a light weight process (thread) these days.
A unix process has one or more lwp's. In top or ps, you see a unix
process by default. BTW, you can use "vmstat -e" to see the number of
FP faults.
> >>2. As soon as a process uses a float instruction for any reason (say
> >> a gratuitous lfd/stfd instuction sequence solely for a data move),
> >> then the process is saddled with a claim on float registers to
> >> save/restore on a context switches -- although the same mechanism
> >> used to trap the float use event in the first place might bar the
> >> necessity to save/restore the registers, unless a float register
> >> is used again in a context switch return.
> >
> > Not true.
>
>Here I'm truly puzzled (which is, of course my usual state). Which
>isn't true? There are really three assertions here.
>
>1. The kernel is obligated to preserve a processes' FP registers once
> the process has used the FP unit. If this is not true, then the
> process must assume that data that it puts in any given FP
> register will become corrupt at any random time.
Yes it is.
>2. On first use of the FP unit, the process is tagged as using the FP
> unit. Since you necessarily imply this point in your discussion
> above, can this be untrue?
It is true.
>3. Given that a process is tagged as using the FP unit, then on a
> context switch, the kernel, one or the other, might theoretically:
>
> a. restore a process's FP register contents to the FP registers,
>
> or,
>
> b. reset the fpu fault trip bit in the msr, so that the kernel
> would only restore the FP register set if and when the process
> faulted on a subsequent FP instruction.
>
> Is there another realistic possibility?
Not really. In reality, NetBSD/powerpc does 3b.
>But I take it that you're probably saying that one or the other point
>3a or 3b is false, and the other is true. I'm curious as to the
>answer, but not enough at present to consult the code -- even though
>it might have some bearing on the matter at hand. I'm guessing that
>in general 3a would be the better strategy under the theory that
>having need of the FP, the user is going to use the heck out of it.
>3b would be a better choice for a user not wanting FP resources but
>unaware that gcc was getting them for him anyway. And of course, you
>could mix the two by profiling, if you reasoned that the effort is
>worth it. Well, those are practical matters best decided wrt the
>facts as you find them. But my money says you, for good and
>sufficient reason, decided on 3a.
Nope. 3b.
>But my point for the moment was merely that, one way or the other,
>resources would be consumed -- and one of those ways could be
>triggered by as much as a single float/store caused by a gcc data
>movement. If that results in a complete FP register set save/restore
>on every context switch, so much the worse for arguing the efficacy of
>doing floating point load/stores for data movement(s). But I really
>wasn't interested in parsing it that far at the time. I would have
>preferred to leave it to you, as the obviously expert witness, to point
>out the facts, be they lie one way or the other.
But fortunately that doesn't happen. It does increase the number of
unneeded FP context switches but at least on the machine I see, not
significantly.
> >>3. If most processes unmeaningly hit a gratuitous float event on most
> >> context switches, then the kernel will be forced to spend a
> >> significant amount of time saving/restoring float register sets.
> >
> > Not true.
>
>Again, which isn't true? Is your point that the kernel doesn't do
>this, or that it does do this, but the resources needed to
>save/restore the float registers is really insignificant? If the
>required resources are insignificant, what are they insignificant with
>regard to? After all, a 64 word write or read to or from main memory
>is not an event that can pass unnoticed: it's more than a minor blurp
>to the cpu. And, these are floating point registers we're talking
>about here -- they may or may not be optimized to access the cache
>line -- it depends on the implementation. If it happens a lot, it will
>cause talk.
I don't believe most processes incur FP traps. My G4 which runs
seti@home in the background,
6:20PM up 2 days, 11:06, 1 user, load averages: 1.21, 1.14, 1.09
cpu0 FPU unavailable traps 413002 1 trap
cpu0 FPU context switches 413001 1 trap
That's not a lot FPU switches considering the FP is being heavily
used by seti@home.
> >>4. A process that did not use float at all might nevertheless be
> >> affected by some (increased) unpredictability in its context
> >> switch-in or switch-out, or in some other way. (Hmmm, I really
> >> haven't thought that one through, but it occurs to me as a
> >> possibility.
> >
> > Not true
>
>Hmmm, here we get to the nubbin of things, and where I'm intrigued.
>I'm quite prepared to accept your answer, but I wish you would amplify
>it with a discussion. After all, this is a system where you can
>generally rely on the notion that "_everything_ affects everything
>else".
Since FP switching is lazy, the non-use of FP won't help or hinder a
lwp since, for it, it's as if the processor didn't have FP.
> >>5. And, (not specifically following from your comments, but I'd like
> >> to point out anyway), the trap to the kernel on first float
> >> instruction use is itself a non-trivial event, which arguably
> >> should happen only if the process is sincere in its intent to use
> >> floating point arithmetic. Presumably, this alone would
> >> disqualify the use of float resources for trivial memory to memory
> >> data movements.
> >
> >This I agree with.
>
>Well, you and I are in hearty agreement here, and, obviously, this
>point speaks loudly to me. I'm sure it's worthy of its own discussion.
>However, I can't see that it's relevant to Jason's points, and as
>such, I rather back-handedly introduced it into this discussion. I
>could wish that someone with more knowledge would show a relation.
--
Matt Thomas Internet: matt@3am-software.com
3am Software Foundry WWW URL: http://www.3am-software.com/bio/matt/
Cupertino, CA Disclaimer: I avow all knowledge of this message