tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: [RFC] Design considerations for XSAVE Extended Area getters/setters
On Tue, 2019-05-28 at 18:08 +0200, Kamil Rytarowski wrote:
> On 28.05.2019 15:20, Michał Górny wrote:
> > Hi,
> >
> > After implementing most of PT_GETXSTATE/PT_SETXSTATE and getting some
> > comments requiring major changes anyway, I'm starting to wonder whether
> > the approach I've followed is actually the best one. This is especially
> > important now that I'm pretty sure that we can't rely on fixed offsets
> > in XSAVE area (courtesy of CPUID data from AVX512 CPU). I'd like to
> > query your opinion on what would be the best API to provide.
> >
> > I see three main options:
> >
> > a. 'raw' XSAVE data,
> >
> > b. 'normalized' XSAVE data,
> >
> > c. split calls for individual components.
> >
> > So far I've followed idea a. I'll describe each one shortly.
> >
> >
> > The problem
> > -----------
> > I'm trying to expose additional register types supported by newer x86
> > processors to the debugger. Currently we're limited to SSE registers,
> > however extended XSAVE area provides support for AVX, AVX-512
> > and possible future extensions.
> >
> > The XSAVE area is dynamically sized, with controllable extensions to
> > include. We need to perform a bunch of CPUID calls to determine
> > the appropriate buffer size and offsets/sizes of individual data
> > components inside it. The exact offsets depend on the processor
> > in question, and may be different depending on interim extensions
> > implemented.
> >
> >
> > a. 'raw' XSAVE data
> > -------------------
> > This is the solution used by Linux and FreeBSD. It adds new PT_*XSTATE
> > requests that expose and operate on raw data used by XSAVE/XRSTOR
> > instructions.
> >
> > Since it uses the processor's data format, userland can reuse the code
> > for other platforms. In fact, if new XSAVE components are added,
> > the support in userland can be developed in parallel to kernel
> > development.
> >
> > On the minus side, it shifts the whole burden of implementation
> > on userland. This didn't go well for Linux and FreeBSD -- both of those
> > implementations did not account for providing component offsets via API,
> > and so require userland to establish them manually by CPUID. This isn't
> > going to work well for storing XSAVE data in coredumps.
> >
> > In my opinion, to implement this correctly we'd need another
> > dynamically-sized PT_* request exposing component offsets (and sizes?)
> > from CPUID. However, at this point I start doubting whether the added
> > complexity is worth it given no clear advantage of sticking to pure
> > machine format. Hence, option b.
> >
> >
> > b. 'normalized' XSAVE data
> > --------------------------
> > This is similar to the previous option, except that rather than exposing
> > the processor's raw XSAVE structure we use a fixed 'struct xstate'.
> > Kernel is reponsible for establishing correct offsets, and moving data
> > to and from 'struct xstate'. Userland can trivially access the struct
> > fields, and does not need to be concerned about internals. Core dumps
> > use fixed offsets.
> >
> > Similarly to the previous option, I'd keep the dynamic size for
> > the structure -- make it possible to add more fields in the future,
> > and expose them to userland when it provides a large enough buffer.
> >
> > I don't see any significant disadvantages to option a., and the little
> > performance loss does not seem to be significant here. We lose
> > the option to add support for new XSAVE types in userland before
> > the kernel but it's probably a moot point anyway.
> >
> >
> > c. Split calls for individual components
> > ----------------------------------------
> > This is a completely different option which will probably make
> > the implementation a bit simpler at the cost of more repeated code.
> > In this model, we add PT_* requests for each interesting XSAVE
> > component. At the moment, this would mean 4 new requests:
> >
> > - PT_*AVXREGS, for the AVX component,
> >
> > - PT_*AVX512REGS for the 3 AVX-512 components.
> >
> > Unlike the other options, we don't use extensible buffer but just fixed
> > structs. We need to add new requests when new components are added.
> > Userland applications need to issue multiple PT_* calls if they need
> > specific data. For example, to get zmm0 register value, you'd need to
> > call PT_GETFPREGS (or PT_GETXMMREGS on i386), PT_GETAVXREGS
> > and PT_GETAVX512REGS to get all the split parts of it.
> >
> >
> > Side point: non-aligned PT_*FPREGS on i386/amd64
> > ------------------------------------------------
> > As a side point, the current interfaces for getting FPU regs on i386
> > and amd64 are not aligned. PT_*FPREGS uses old FSAVE struct on i386,
> > while (newer) FXSAVE struct is used on amd64. The former lacks SSE
> > registers.
> >
> > This is somewhat circumvented by adding PT_GETXMMREGS on i386 which uses
> > the FXSAVE format and adds missing registers to i386. However, our
> > current implementation of compat32 does not allow it to work without
> > either hacks or major design changes.
> >
> > Both options a. and b. use struct compatible with FXSAVE and including
> > both x87 FPU and SSE registers. Backwards compatibility with FSAVE
> > and FXSAVE is easy to do with the existing kernel code. Therefore,
> > PT_*XSTATE can easily replace PT_*FPREGS/PT_*XMMREGS, providing
> > an aligned i386/amd64 API for getting FPU registers.
> >
> >
> > Summary
> > -------
> > In my opinion, the best solution here is b. It's extensible to future
> > register types, easy to use from userland perspective and reduces
> > the number of ptrace() calls debugger needs to issue. It also
> > circumvents compat32 limitations.
> >
> > What do you think?
> >
>
> What's the format of fixed xsave? Is it already used/defined somewhere?
No. The idea was to just put all interesting components in bigger
struct, so it would consist of series of structs whose internal format
would match the one used by XSAVE.
> Personally I'm in favor of (c) as long as it is possible to be
> implemented and we won't run into different layouts of PT_*AVXREGS and
> similar depending on CPU.
>
> Rationale for (c).
>
> 1. This is the current approach (ppc has: *VECREGS, i386 has *XMMREGS)
> in existing ports.
>
> 2. Prompting the FPU state is not a performance critical operation in
> debuggers. It's statistically a rare operation, as we usually just
> prompt for the GP registers.
Sure. However, I'd find a design that relies on doing three separate
requests to obtain data for one register weird.
> 3. We can skip extra logic in the kernel, without attempts to normalize
> them, and just output registers as they are, leaving the logic to be
> done by a debugger.
>
> 4. In future CPUs the normalization of FPU registers can no longer match
> the needs and we might need to reformat the layout.
There is no difference in internal layout or logic between b. and c.
In either case, we need to perform XSAVE, process it and copy the data
into internal structure. The only difference is that in b. we handle it
all in one request, and in c. we do three requests copying different
parts of XSAVE to three different buffers.
> 5. Using XSAVE calls can be misleading on older x86 CPUs without XSAVE
> support.
It's called XSTORE, so there should not be much confusion about
the naming.
> 6. Adding just 2 extra calls will be more compatible with existing
> software and core(5) files. It won't stop working due to e.g. lack of
> PT_FPREGS dump in core files.
There is no reason not to include both in core files, at least for
the time being.
> 7. Personally I don't like that there is an overlap/duplication between
> proposed XSAVE and other existing operations.
There is overlap/duplication between FPREGS and XMMREGS already.
>
> The compat32 case for XMMREGS can be handled with #if _KERNEL define
> XMMREGS #endif and mapping them into i386. It won't be pretty, but it
> will do the job.
That's what I said.
--
Best regards,
Michał Górny
Home |
Main Index |
Thread Index |
Old Index