Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: AMDGPU Driver patches/bugs



On Tue, Feb 21, 2023 at 3:40 AM Taylor R Campbell <riastradh%netbsd.org@localhost> wrote:
> [...]
> By the way, you may be able to just add `load amdgpu' to boot.cfg
> instead of compiling a custom kernel with amdgpu.
>
> Loading amdgpu as a module also has the advantage that it doesn't
> break dtrace (due to annoying technical restrictions in CTF which the
> kernel violates when amdgpu is statically linked because it is so
> large).
> [...]

I'll take a look at this.  I'm probably going to continue to build a
custom kernel
for the fun of it (but also regularly test things with the generic kernel)

> > The problem with the doorbell code is that the Linux code uses
> > adev->doorbell.ptr + index to get the address to write to.  ptr is
> > ultimately a pointer to a 32 bit wide value (rather than the 64 bit
> > wide value it actually is :-/ ), so the compiler's pointer math
> > multiplies index by 4 instead of 8, as the NetBSD dev who wrote the
> > code would have expected.
>
> Amazing!  I must have stared at that code for hours trying to track
> down the ring test failures, without realizing that the pointer was
> typed 32-bit instead of 64-bit.

It was dumb.  I think the main thing that let me see it was the shift lefts
giving me an unsettled feeling and cross checking against how Linux
works (because that's the functioning version).

> ...I don't suppose you have another trick up your sleeve for the
> radeon driver, do you?  We've also been seeing intermittent ring test
> failures at boot, but it doesn't use any 64-bit doorbells, so this
> trick doesn't work, alas.

Mostly my trick was running the code many (many) times, lots of debug printfs
and reading it carefully.  And having it block my laptop from even booting
(well, it works with genfb just fine).  I'll see about getting my
desktop running
against both the radeon and AMDGPU drivers and see if I can reproduce it.

I was going to try the radeon driver again, because I want to see if my wayland
compositor works better against it than the AMDGPU driver (I'm getting some
weird corruption problems with my compositor that do not happen under Linux,
but that's probably my code).

> > (The driver blows up spectacularly shortly thereafter by causing a
> > floating point exception in kernel mode.  I don't have a full fix for
> > that yet.  The thing I did try that seems to get further causes the
> > screen to go blank.  I have a plan for debugging this, but I haven't
> > gotten there yet.)
>
> If you have a stack trace or crash dump I might be able to help.  The
> amdgpu driver apparently uses FP/SIMD instructions in the kernel, and
> I wired it up to NetBSD's mechanism for allowing it to do that, but I
> don't know if I've ever seen those parts of the code get hit and
> perhaps I missed something.

Hopefully I'll have time tonight to re-run that and write up the
details.  I would
definitely love to have your thoughts on what's going on, so I'll get
to it shortly.

You probably haven't seen the SIMD instructions get hit because I don't think
they were getting built (or if they were, the code path to running them wasn't
built due to an #ifdef).  I'll double check on my laptop, and I might have more
patches to send in.

> > I've attached patches.  Should I open a bug?  Send these to the kernel
> > mailing list?
>
> Patches applied, thanks!  I tweaked them a little bit, including to
> fix an arithmetic overflow bug that you had copied & pasted from one
> Taylor R Campbell, riastradh%NetBSD.org@localhost, in kern_ksyms.c...oops.  (Fix
> also applied in kern_ksyms.c now.)
>
> Feel free to file PRs with patches and/or cc me and tech-kern -- I
> don't always follow current-users.

Awesome!  I see them in the source tree.  Thank you for fixing the
overflow, I skipped
it because I was juggling too many other things in my head and I
realized it wasn't
likely to actually trip given current hardware (but no sense in
setting traps for
future generations).  (rdoorbell64 was just an oversight, I was only
dealing with
code that wrote to the doorbells and hadn't tripped over reading from
them.  Hopefully
this clears that rake from the yard.)

I'll send the FP stuff to tech-kern and CC you.  For the PRs, that's
the sendpr.cgi
on netbsd.org, right?


Home | Main Index | Thread Index | Old Index