[RFC] Design considerations for XSAVE Extended Area getters/setters

To: tech-kern <tech-kern%NetBSD.org@localhost>
Subject: [RFC] Design considerations for XSAVE Extended Area getters/setters
From: Michał Górny <mgorny%NetBSD.org@localhost>
Date: Tue, 28 May 2019 15:20:00 +0200

Hi,

After implementing most of PT_GETXSTATE/PT_SETXSTATE and getting some
comments requiring major changes anyway, I'm starting to wonder whether
the approach I've followed is actually the best one.  This is especially
important now that I'm pretty sure that we can't rely on fixed offsets
in XSAVE area (courtesy of CPUID data from AVX512 CPU).  I'd like to
query your opinion on what would be the best API to provide.

I see three main options:

a. 'raw' XSAVE data,

b. 'normalized' XSAVE data,

c. split calls for individual components.

So far I've followed idea a.  I'll describe each one shortly.


The problem
-----------
I'm trying to expose additional register types supported by newer x86
processors to the debugger.  Currently we're limited to SSE registers,
however extended XSAVE area provides support for AVX, AVX-512
and possible future extensions.

The XSAVE area is dynamically sized, with controllable extensions to
include.  We need to perform a bunch of CPUID calls to determine
the appropriate buffer size and offsets/sizes of individual data
components inside it.  The exact offsets depend on the processor
in question, and may be different depending on interim extensions
implemented.


a. 'raw' XSAVE data
-------------------
This is the solution used by Linux and FreeBSD.  It adds new PT_*XSTATE
requests that expose and operate on raw data used by XSAVE/XRSTOR
instructions.

Since it uses the processor's data format, userland can reuse the code
for other platforms.  In fact, if new XSAVE components are added,
the support in userland can be developed in parallel to kernel
development.

On the minus side, it shifts the whole burden of implementation
on userland.  This didn't go well for Linux and FreeBSD -- both of those
implementations did not account for providing component offsets via API,
and so require userland to establish them manually by CPUID.  This isn't
going to work well for storing XSAVE data in coredumps.

In my opinion, to implement this correctly we'd need another
dynamically-sized PT_* request exposing component offsets (and sizes?)
from CPUID.  However, at this point I start doubting whether the added
complexity is worth it given no clear advantage of sticking to pure
machine format.  Hence, option b.


b. 'normalized' XSAVE data
--------------------------
This is similar to the previous option, except that rather than exposing
the processor's raw XSAVE structure we use a fixed 'struct xstate'. 
Kernel is reponsible for establishing correct offsets, and moving data
to and from 'struct xstate'.  Userland can trivially access the struct
fields, and does not need to be concerned about internals.  Core dumps
use fixed offsets.

Similarly to the previous option, I'd keep the dynamic size for
the structure -- make it possible to add more fields in the future,
and expose them to userland when it provides a large enough buffer.

I don't see any significant disadvantages to option a., and the little
performance loss does not seem to be significant here.  We lose
the option to add support for new XSAVE types in userland before
the kernel but it's probably a moot point anyway.


c. Split calls for individual components
----------------------------------------
This is a completely different option which will probably make
the implementation a bit simpler at the cost of more repeated code.
In this model, we add PT_* requests for each interesting XSAVE
component.  At the moment, this would mean 4 new requests:

- PT_*AVXREGS, for the AVX component,

- PT_*AVX512REGS for the 3 AVX-512 components.

Unlike the other options, we don't use extensible buffer but just fixed
structs.  We need to add new requests when new components are added. 
Userland applications need to issue multiple PT_* calls if they need
specific data.  For example, to get zmm0 register value, you'd need to
call PT_GETFPREGS (or PT_GETXMMREGS on i386), PT_GETAVXREGS
and PT_GETAVX512REGS to get all the split parts of it.


Side point: non-aligned PT_*FPREGS on i386/amd64
------------------------------------------------
As a side point, the current interfaces for getting FPU regs on i386
and amd64 are not aligned.  PT_*FPREGS uses old FSAVE struct on i386,
while (newer) FXSAVE struct is used on amd64.  The former lacks SSE
registers.

This is somewhat circumvented by adding PT_GETXMMREGS on i386 which uses
the FXSAVE format and adds missing registers to i386.  However, our
current implementation of compat32 does not allow it to work without
either hacks or major design changes.

Both options a. and b. use struct compatible with FXSAVE and including
both x87 FPU and SSE registers.  Backwards compatibility with FSAVE
and FXSAVE is easy to do with the existing kernel code.  Therefore,
PT_*XSTATE can easily replace PT_*FPREGS/PT_*XMMREGS, providing
an aligned i386/amd64 API for getting FPU registers.


Summary
-------
In my opinion, the best solution here is b.  It's extensible to future
register types, easy to use from userland perspective and reduces
the number of ptrace() calls debugger needs to issue.  It also
circumvents compat32 limitations.

What do you think?

-- 
Best regards,
Michał Górny

Attachment: signature.asc
Description: This is a digitally signed message part

Follow-Ups:
- Re: [RFC] Design considerations for XSAVE Extended Area getters/setters
  - From: Kamil Rytarowski

Prev by Date: Re: ehci: fix error handling?
Next by Date: Re: [RFC] Design considerations for XSAVE Extended Area getters/setters
Previous by Thread: [PATCH 1/2] compat32: translate userland PT_* request values into kernel
Next by Thread: Re: [RFC] Design considerations for XSAVE Extended Area getters/setters
Indexes:

Home | Main Index | Thread Index | Old Index