port-i386: FPU context switching

Subject: FPU context switching
To: None <port-i386@netbsd.org>
From: Andreas Gustafsson <gson@araneus.fi>
List: port-i386
Date: 01/17/1999 17:44:03
I have been studying the FPU context switching code of NetBSD/i386
(sys/arch/i386/isa/npx.c) trying to figure out the best way to add
support for using the FPU in device drivers.

I think I'm finally beginning to understand how it works, but the code
is sufficiently convoluted and confusing that I would like to run my
findings by the port-i386 list to check whether my understanding is
correct.

The basic procedure for speculative deferred FPU context switching
is outlined in "Intel Architecture Software Developer's Manual
Volume 1: Basic Architecture" (Intel order n:o 243190), section D.3.6.
According to this document, the procedure involves setting the TS bit
in CR0 at every process context switch.  Therefore, there will always
be a DNA trap on the first attempt to use of the FPU after a process
context switch, even if the process trying to use the FPU is same one
that last used it (in NetBSD kernel parlance, when curproc==npxproc).
When this happens, the DNA trap handler will notice that no FPU
context switch is needed and return immediately.  This short-circuit
path is visible in figure D-5 of the Intel document.

The way NetBSD/i386 works is actually subtly different.  NetBSD does
not set TS on every process context switch; instead, it keeps a
per-process copy of CR0 in the machine-dependent part of the proc
structure, and loads CR0 from this copy at each process context
switch.  Furthermore, NetBSD maintains the invariant that only the
process currently owning the FPU state (i.e., npxproc) has the TS bit
cleared in the saved CR0 in its struct proc; all other processes have
TS set in their saved CR0.  The net effect is that the TS bit is set
on every process context switch as in the Intel case, except when
switching to npxproc.  When switching to npxproc, the TS bit will
instead be cleared, and the unnecessary DNA trap that would occur in
the Intel-style case is thereby avoided.

At the beginning of the npxdna() trap handler in npx.c, there is the 
following comment:

 * If the we were the last process to use the FPU, we can simply return.

This comment matches the Intel flow chart, but unfortunately it does
not match the code that follows.  If npxdna() is ever invoked with
curproc==npxproc, it will not simply return like the comment says.
Instead, it will first reinitialize the FPU and then load the FPU
state from the proc structure.  That looks like a strange thing to do.
Things do work anyway, but only because because npxdna() is in fact
never called when curproc==npxproc.

Is this description consistent with reality, or am I hopelessly confused?
-- 
Andreas Gustafsson, gson@araneus.fi