AMDGPU: Floating Point traps in Display Core code

To: tech-kern%netbsd.org@localhost
Subject: AMDGPU: Floating Point traps in Display Core code
From: Jeff Frasca <thatguy%jeff-frasca.name@localhost>
Date: Fri, 24 Feb 2023 23:21:35 -0800

Ok, first off, the FP code I've run into is in the Display
Core code, specifically in:
  sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c
It's all SIMD code operating on xmmN registers.  To get to
this codepath, I needed to have CONFIG_DRM_AMD_DC set during
compilation.  I've attached a diff that adds this to files.amdgpu.

A typical backtrace printed out by ddb is:
breakpoint()
vpanic()
panic()
fputrap()
Xtrap16()
dcn10_create_resource_pool()
dc_create()
dm_hw_init()
amdgpu_device_init()
amdgpu_driver_load_kms()
drm_dev_register()
amdgpu_attach_real()
config_mountroot_thread()

(I had to type it manually from a picture snapped on my
phone, so, no offsets, if any of those are of interest,
let me know.)

There's a missing call from the backtrace that (I think)
gets eaten by the trap jump: dcn_bw_update_from_pplib().
(It's in amdgpu_dcn_calcs.c)

The actual trap number that's getting generated is 19
rather than the 16 implied by the call to Xtrap16 (but
I suspect y'all understand that quirk better than I do.)

dcn_bw_update_from_pplib() dutifully calls the macro
DC_FP_START(), which I believe Taylor wired up to call
fpu_kern_enter(), which seems like it should do the right
thing.  However, the x86 fpu_kern_enter() only appears to
save registers and mask off the x87 FP trap flag in CR0.

The instruction that's causing the trap in this case is
the very first FP instruction in the function, and it's
tripping the precision exception (MXCSR is set to 0x20
when printed out in fputrap() by a debug printf I added
in my local build; this is also where I'm getting the
trap number 19 rather than 16).

The kicker is, sometimes it gets further.  I think there
are three different functions that do floating point math
in the init code path for my GPU, and I have seen it fail
in all three of them on repeated tries.  But it always
fails in one of them.

Unless I add another function call to DC_FP_START() that
masks all the non-fatal FP traps in MXCSR.  I tried
setting it to 0x00001d00 and 0x00009d40.  The former just
masks the non-fatal traps and the latter tries to set "do
sane things with edge cases" flags.  (If I try to mask
MXCSR in fpu_kern_enter(), then some of the crypto code
breaks.)

With the mask, it makes it through all the FP code, but the
screen gets blanked at a later failure point.  (I only have my
laptop right now, which has the problem GPU in question,
but won't have access to a second machine until next
weekend.  Once I do, I'm planning diverting kernel messages
and ddb io to a usb tty and trying this again.  If anyone
has any tips on how to do that or if there's a way to do
that which I haven't seen, I'd love to hear it.)

Anyway, just so you can see what I did, I attached the fpu
patch.  I am not wedded to this code at all, and there may
be a much better way to do it.  I left the minimal constant
in the patch (0x1d00).  Nothing needs to happen to
fpu_kern_leave() or DC_FP_END(), because the
fpu_kern_enter() call should save MXCSR and
fpu_kern_leave() should restore over our changes (if I
have read the code and docs correctly).

Any thoughts?  Is there something that set me up for a
later failure?  Should the FPU code be rewritten?

Jeff

diff -u a/src/sys/external/bsd/drm2/amdgpu/files.amdgpu b/src/sys/external/bsd/drm2/amdgpu/files.amdgpu
--- a/src/sys/external/bsd/drm2/amdgpu/files.amdgpu	2022-07-24 13:05:00.000000000 -0700
+++ b/src/sys/external/bsd/drm2/amdgpu/files.amdgpu	2023-01-16 10:04:25.104684173 -0800
@@ -33,6 +33,7 @@
 makeoptions	amdgpu	"CPPFLAGS.amdgpu"+="-I$S/external/bsd/drm2/dist/drm/amd/display/dmub/inc"
 
 makeoptions	amdgpu	"CPPFLAGS.amdgpu"+="-DCONFIG_DRM_AMD_ACP=1"
+makeoptions	amdgpu	"CPPFLAGS.amdgpu"+="-DCONFIG_DRM_AMD_DC=1"
 makeoptions	amdgpu	"CPPFLAGS.amdgpu"+="-DCONFIG_DRM_AMD_DC_DCN=1"
 makeoptions	amdgpu	"CPPFLAGS.amdgpu"+="-DCONFIG_DRM_AMD_DC_HDCP=1"
 makeoptions	amdgpu	"CPPFLAGS.amdgpu"+="-DCONFIG_PERF_EVENTS=0"

diff -u a/src/sys/arch/x86/include/fpu.h b/src/sys/arch/x86/include/fpu.h
--- a/src/sys/arch/x86/include/fpu.h	2020-10-24 00:14:29.000000000 -0700
+++ b/src/sys/arch/x86/include/fpu.h	2023-01-18 14:05:19.119602523 -0800
@@ -35,6 +35,7 @@
 
 void fpu_kern_enter(void);
 void fpu_kern_leave(void);
+void fpu_kern_mask_mx_ex(void);
 
 void process_write_fpregs_xmm(struct lwp *, const struct fxsave *);
 void process_write_fpregs_s87(struct lwp *, const struct save87 *);
diff -u a/src/sys/arch/x86/x86/fpu.c b/src/sys/arch/x86/x86/fpu.c
--- a/src/sys/arch/x86/x86/fpu.c	2022-08-20 04:34:08.000000000 -0700
+++ b/src/sys/arch/x86/x86/fpu.c	2023-02-24 22:39:57.399550404 -0800
@@ -409,6 +409,20 @@
 	clts();
 }
 
+void
+fpu_kern_mask_mx_ex(void)
+{
+	uint32_t mxcsr;
+
+	x86_stmxcsr(&mxcsr);
+	/* Mask off Precision, Underflow, Overflow and denormalized
+	 * operand exceptions, but still blow up on divide by zero and
+	 * invalid instructions
+	 */
+	mxcsr |= 0x00001d00;
+	x86_ldmxcsr(&mxcsr);
+}
+
 /*
  * fpu_kern_leave()
  *
diff -u a/src/sys/external/bsd/drm2/dist/drm/amd/display/dc/os_types.h b/src/sys/external/bsd/drm2/dist/drm/amd/display/dc/os_types.h
--- a/src/sys/external/bsd/drm2/dist/drm/amd/display/dc/os_types.h	2022-07-24 13:05:08.000000000 -0700
+++ b/src/sys/external/bsd/drm2/dist/drm/amd/display/dc/os_types.h	2023-02-24 22:38:23.054679729 -0800
@@ -56,7 +56,7 @@
 #ifdef __NetBSD__
 #if defined(__i386__) || defined(__x86_64__)
 #include <x86/fpu.h>
-#define	DC_FP_START()	fpu_kern_enter()
+#define	DC_FP_START()	{fpu_kern_enter(); fpu_kern_mask_mx_ex();}
 #define	DC_FP_END()	fpu_kern_leave()
 #elif defined(__arm__) || defined(__aarch64__)
 #include <arm/fpu.h>

Follow-Ups:
- Re: AMDGPU: Floating Point traps in Display Core code
  - From: Taylor R Campbell

Prev by Date: Re: Nixing __HAVE_ATOMIC_AS_MEMBAR
Next by Date: MI boot option helper functions
Previous by Thread: Vmstat -s on -current and -10 shows no local-cpu page allocations under Xen -- is that correct?
Next by Thread: Re: AMDGPU: Floating Point traps in Display Core code
Indexes:

Home | Main Index | Thread Index | Old Index