port-i386: Floating point hardware vs. software

Subject: Floating point hardware vs. software
To: None <port-i386@NetBSD.ORG>
From: Michael Graff <explorer@flame.org>
List: port-i386
Date: 02/02/1996 03:24:44
Has anyone looked at the ``gnu'' version of the FPE?

The README file is contained below, and it appears to be a Berkeley
style copyright for use in NetBSD, among others.

I am getting sick of having to edit /usr/src/lib/libm/Makefile to tell
it to use the FP hardware.  This is really a bad spot for the i386
port, imho.

--Michael

--
Michael Graff <explorer@flame.org>        NetBSD is the way to go!
PGP key on a key-server near you!         Netshade the world!

/*
 *  wm-FPU-emu   an FPU emulator for 80386 and 80486SX microprocessors.
 *
 *
 * Copyright (C) 1992,1993,1994
 *                       W. Metzenthen, 22 Parker St, Ormond, Vic 3163,
 *                       Australia.  E-mail   billm@vaxc.cc.monash.edu.au
 * All rights reserved.
 *
 * This copyright notice covers the redistribution and use of the
 * FPU emulator developed by W. Metzenthen. It covers only its use
 * in the 386BSD, FreeBSD and NetBSD operating systems. Any other
 * use is not permitted under this copyright.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must include information specifying
 *    that source code for the emulator is freely available and include
 *    either:
 *      a) an offer to provide the source code for a nominal distribution
 *         fee, or
 *      b) list at least two alternative methods whereby the source
 *         can be obtained, e.g. a publically accessible bulletin board
 *         and an anonymous ftp site from which the software can be
 *         downloaded.
 * 3. All advertising materials specifically mentioning features or use of
 *    this emulator must acknowledge that it was developed by W. Metzenthen.
 * 4. The name of W. Metzenthen may not be used to endorse or promote
 *    products derived from this software without specific prior written
 *    permission.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL
 * W. METZENTHEN BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
 * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 *
 * The purpose of this copyright, based upon the Berkeley copyright, is to
 * ensure that the covered software remains freely available to everyone.
 *
 * The software (with necessary differences) is also available, but under
 * the terms of the GNU copyleft, for the Linux operating system and for
 * the djgpp ms-dos extender.
 *
 * W. Metzenthen   June 1994.
 *
 */

wm-FPU-emu is an FPU emulator for Linux. It is derived from wm-emu387
which is my 80387 emulator for djgpp (gcc under msdos); wm-emu387 was
in turn based upon emu387 which was written by DJ Delorie for djgpp.
The interface to the Linux kernel is based upon the original Linux
math emulator by Linus Torvalds.

My target FPU for wm-FPU-emu is that described in the Intel486
Programmer's Reference Manual (1992 edition). Numerous facets of the
functioning of the FPU are not well covered in the Reference Manual;
in the absence of clear details I have made guesses about the most
reasonable behaviour. Recently, this situation has improved because
I now have some access to the results produced by a real 80486 FPU.

wm-FPU-emu does not implement all of the behaviour of the 80486 FPU. 
See "Limitations" later in this file for a partial list of some
differences.  I believe that the missing features are never used by
normal C or FORTRAN programs. 


Please report bugs, etc to me at:
       apm233m@vaxc.cc.monash.edu.au


--Bill Metzenthen
  May 1993


----------------------- Internals of wm-FPU-emu -----------------------

Numeric algorithms:
(1) Add, subtract, and multiply. Nothing remarkable in these.
(2) Divide has been tuned to get reasonable performance. The algorithm
    is not the obvious one which most people seem to use, but is designed
    to take advantage of the characteristics of the 80386. I expect that
    it has been invented many times before I discovered it, but I have not
    seen it. It is based upon one of those ideas which one carries around
    for years without ever bothering to check it out.
(3) The sqrt function has been tuned to get good performance. It is based
    upon Newton's classic method. Performance was improved by capitalizing
    upon the properties of Newton's method, and the code is once again
    structured taking account of the 80386 characteristics.
(4) The trig, log, and exp functions are based in each case upon quasi-
    "optimal" polynomial approximations. My definition of "optimal" was
    based upon getting good accuracy with reasonable speed.

The code of the emulator is complicated slightly by the need to
account for a limited form of re-entrancy. Normally, the emulator will
emulate each FPU instruction to completion without interruption.
However, it may happen that when the emulator is accessing the user
memory space, swapping may be needed. In this case the emulator may be
temporarily suspended while disk i/o takes place. During this time
another process may use the emulator, thereby changing some static
variables (eg FPU_st0_ptr, etc). The code which accesses user memory
is confined to five files:
    fpu_entry.c
    reg_ld_str.c
    load_store.c
    get_address.c
    errors.c

----------------------- Limitations of wm-FPU-emu -----------------------

There are a number of differences between the current wm-FPU-emu
(version beta 1.4) and the 80486 FPU (apart from bugs). Some of the
more important differences are listed below:

All internal computations are performed at 64 bit or higher precision
and rounded etc as required by the PC bits of the FPU control word.
Under the crt0 version for Linux current at March 1993, the FPU PC
bits specify 53 bits precision.

The precision flag (PE of the FPU status word) and the Roundup flag
(C1 of the status word) are now partially implemented. Does anyone
write code which uses these features?

The functions which load/store the FPU state are partially implemented,
but the implementation should be sufficient for handling FPU errors etc
in 32 bit protected mode.

The implementation of the exception mechanism is flawed for unmasked
interrupts.

Detection of certain conditions, such as denormal operands, is not yet
complete.

----------------------- Performance of wm-FPU-emu -----------------------

Speed.
-----

The speed of floating point computation with the emulator will depend
upon instruction mix. Relative performance is best for the instructions
which require most computation. The simple instructions are adversely
affected by the fpu instruction trap overhead.


Timing: Some simple timing tests have been made on the emulator functions.
The times include load/store instructions. All times are in microseconds
measured on a 33MHz 386 with 64k cache. The Turbo C tests were under
ms-dos, the next two columns are for emulators running with the djgpp
ms-dos extender. The final column is for wm-FPU-emu in Linux 0.97,
using libm4.0 (hard).

function      Turbo C        djgpp 1.06        WM-emu387     wm-FPU-emu

   +          60.5           154.8              76.5          139.4
   -          61.1-65.5      157.3-160.8        76.2-79.5     142.9-144.7
   *          71.0           190.8              79.6          146.6
   /          61.2-75.0      261.4-266.9        75.3-91.6     142.2-158.1

 sin()        310.8          4692.0            319.0          398.5
 cos()        284.4          4855.2            308.0          388.7
 tan()        495.0          8807.1            394.9          504.7
 atan()       328.9          4866.4            601.1          419.5-491.9

 sqrt()       128.7          crashed           145.2          227.0
 log()        413.1-419.1    5103.4-5354.21    254.7-282.2    409.4-437.1
 exp()        479.1          6619.2            469.1          850.8


The performance under Linux is improved by the use of look-ahead code.
The following results show the improvement which is obtained under
Linux due to the look-ahead code. Also given are the times for the
original Linux emulator with the 4.1 'soft' lib.

 [ Linus' note: I changed look-ahead to be the default under linux, as
   there was no reason not to use it after I had edited it to be
   disabled during tracing ]

            wm-FPU-emu w     original w
            look-ahead       'soft' lib
   +         106.4             190.2
   -         108.6-111.6      192.4-216.2
   *         113.4             193.1
   /         108.8-124.4      700.1-706.2

 sin()       390.5            2642.0
 cos()       381.5            2767.4
 tan()       496.5            3153.3
 atan()      367.2-435.5     2439.4-3396.8

 sqrt()      195.1            4732.5
 log()       358.0-387.5     3359.2-3390.3
 exp()       619.3            4046.4


These figures are now somewhat out-of-date. The emulator has become
progressively slower for most functions as more of the 80486 features
have been implemented.


----------------------- Accuracy of wm-FPU-emu -----------------------


Accuracy: The following table gives the accuracy of the sqrt(), trig
and log functions. Each function was tested at about 400 points. Ideal
results would be 64 bits. The reduced accuracy of cos() and tan() for
arguments greater than pi/4 can be thought of as being due to the
precision of the argument x; e.g. an argument of pi/2-(1e-10) which is
accurate to 64 bits can result in a relative accuracy in cos() of about
64 + log2(cos(x)) = 31 bits. Results for the Turbo C emulator are given
in the last column.


Function      Tested x range            Worst result (bits)         Turbo C

sqrt(x)       1 .. 2                    64.1                         63.2
atan(x)       1e-10 .. 200              62.6                         62.8
cos(x)        0 .. pi/2-(1e-10)         63.2 (x <= pi/4)             62.4
                                        35.2 (x = pi/2-(1e-10))      31.9
sin(x)        1e-10 .. pi/2             63.0                         62.8
tan(x)        1e-10 .. pi/2-(1e-10)     62.4 (x <= pi/4)             62.1
                                        35.2 (x = pi/2-(1e-10))      31.9
exp(x)        0 .. 1                    63.1                         62.9
log(x)        1+1e-6 .. 2               62.4                         62.1


As of version 1.3 of the emulator, the accuracy of the basic
arithmetic has been improved (by a small fraction of a bit). Care has
been taken to ensure full accuracy of the rounding of the basic
arithmetic functions (+,-,*,/,and fsqrt), and they all now produce
results which are exact to the 64th bit (unless there are any bugs
left). To ensure this, it was necessary to effectively get information
of up to about 128 bits precision. The emulator now passes the
"paranoia" tests (compiled with gcc 2.3.3) for 'float' variables (24
bit precision numbers) when precision control is set to 24, 53 or 64
bits, and for 'double' variables (53 bit precision numbers) when
precision control is set to 53 bits (a properly performing FPU cannot
pass the 'paranoia' tests for 'double' variables when precision
control is set to 64 bits).

------------------------- Contributors -------------------------------

A number of people have contributed to the development of the
emulator, often by just reporting bugs, sometimes with a suggested
fix, and a few kind people have provided me with access in one way or
another to an 80486 machine. Contributors include (to those people who
I have forgotten, please excuse me):

Linus Torvalds
Tommy.Thorn@daimi.aau.dk
Andrew.Tridgell@anu.edu.au
Nick Holloway alfie@dcs.warwick.ac.uk
Hermano Moura moura@dcs.gla.ac.uk
Jon Jagger J.Jagger@scp.ac.uk
Lennart Benschop
Brian Gallew geek+@CMU.EDU
Thomas Staniszewski ts3v+@andrew.cmu.edu
Martin Howell mph@plasma.apana.org.au
M Saggaf alsaggaf@athena.mit.edu
Peter Barker PETER@socpsy.sci.fau.edu
tom@vlsivie.tuwien.ac.at
Dan Russel russed@rpi.edu
Daniel Carosone danielce@ee.mu.oz.au
cae@jpmorgan.com
Hamish Coleman t933093@minyos.xx.rmit.oz.au

...and numerous others who responded to my request for help with
a real 80486.