Re: AMDGPU: Floating Point traps in Display Core code

I'm replying on my phone. My apologies if it does dumb things to the formatting.

On Sat, Feb 25, 2023, 06:18 Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost> wrote:

> Date: Fri, 24 Feb 2023 23:21:35 -0800
> From: Jeff Frasca <thatguy%jeff-frasca.name@localhost>
>
> Ok, first off, the FP code I've run into is in the Display
> Core code, specifically in:
> sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c
> It's all SIMD code operating on xmmN registers. To get to
> this codepath, I needed to have CONFIG_DRM_AMD_DC set during
> compilation. I've attached a diff that adds this to files.amdgpu.
>
> A typical backtrace printed out by ddb is:
> breakpoint()
> vpanic()
> panic()
> fputrap()
> Xtrap16()
> dcn10_create_resource_pool()
> [...]
>
> (I had to type it manually from a picture snapped on my
> phone, so, no offsets, if any of those are of interest,
> let me know.)

Probably not, except perhaps the one in dcn10_create_resource_pool to
confirm that it is where you think it is, in dcn_bw_update_from_pplib.

> There's a missing call from the backtrace that (I think)
> gets eaten by the trap jump: dcn_bw_update_from_pplib().
> (It's in amdgpu_dcn_calcs.c)

That's via dcn10_resource_construct, I assume? (which is a single-use
static that presumably gets compiled away)

The missing call is dcn_bw_update_from_pplib(), and it's definitely where the exception is thrown. I confirmed this with debug printfs in my local build.

> [...]
The logic looks like this:
[...]
So the return address of fputrap will always live in the Xtrap16
symbol, not the Xtrap19 one, even if it gets there by trap 19.

Neat, and thanks.

> [...]

This looks like a mistake on my part. It's possible that we never
noticed with the crypto code because it largely doesn't deal in
floating-point exceptions, and that's all that we've use the FP/SIMD
unit for in the kernel so far.

But we should have set the MXCSR (and FPSW/FPCW, if that matters) to a
reliable state. And we need to do that anyway for crypto on CPUs with
the MCDT bug (https://gnats.netbsd.org/57230).

Since we're definitely not in a position to handle floating-point
exception traps in the kernel, I just committed a change to set MXCSR
to 0x1fbf (all exception status bits set, denormals-are-zero disabled,
all exception trap mask bits set, round-to-nearest/ties-to-even,
flush-to-zero disabled).

Cool. Because this looked like the right way to do it.

Does that change anything?

> Unless I add another function call to DC_FP_START() that
> masks all the non-fatal FP traps in MXCSR. I tried
> setting it to 0x00001d00 and 0x00009d40. The former just
> masks the non-fatal traps and the latter tries to set "do
> sane things with edge cases" flags. (If I try to mask
> MXCSR in fpu_kern_enter(), then some of the crypto code
> breaks.)

Can you be more specific about the crypto code breaking? Does it
still break with the change I just committed? (I verified all the
self-tests at boot run under qemu before committing, of course, but
it's possible I broke something on real hardware.)

https://mail-index.netbsd.org/source-changes/2023/02/25/msg143547.html

I will try this as soon as I can. Probably tomorrow evening. I think I may have gotten the mask wrong when modifying fpu_kern_enter() the first time, so there's a good chance this will work. (I switched to the separate function call just to isolate my changes to the AMDGPU driver and figured we'd have this discussion at this time :-) Hopefully I don't reproduce the crash I saw in the crypto code.