tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Straw proposal: MI kthread vector/fp unit API



On Mon, 22 Jun 2020, Taylor R Campbell wrote:

> > Date: Mon, 22 Jun 2020 18:45:47 +0000 (UTC)
> > From: Eduardo Horvath <eeh%NetBSD.org@localhost>
> > 
> > I think this is sort of a half-measure since it restricts
> > coprocessor usage to a few threads.  If you want to say, implement
> > the kenrel memcopy using vector registers (the way sparc64 does)
> > this doesn't help and may end up getting in the way.
> 
> Why do you think this restricts it to a few threads or gets in the way
> of anything?
> 
> As I wrote in my original message:
> 
>    That way, for example, you can use (say) an AES encryption routine
>    aes_enc as a subroutine anywhere in the kernel, and an MD definition
>    of aes_enc can internally use AES-NI with the appropriate MD
>    fpu_kern_enter -- but it's a little cheaper to use aes_enc in an
>    FPU-enabled kthread.  This gave a modest measurable boost to cgd(4)
>    throughput in my preliminary experiments.
> 
> Note that the subroutine (here aes_enc, but it could in principle be
> memcpy too) works `anywhere in the kernel', not just restricted to a
> few threads.
> 
> The definition of aes_enc with AES-NI CPU instructions on x86 already
> works (https://mail-index.netbsd.org/tech-kern/2020/06/18/msg026505.html
> for details); just putting kthread_fpu_enter/exit around cgd_process
> in cgd.c improved throughput on a RAM-backed disk by about 20%
> (presumably mostly because it avoids zeroing the fpu registers on
> every aes_* call in that thread).

It sounded to me as if you set a flag in the kthread indicating that 
thread is allowed to use FPU instructions.  Maybe I'm missing something 
but from the description I assumed you created a kthread, set the flag, 
and now you can start using the FPU.

I suppose I could be mistaken and the flag is being controlled by 
kthread_fpu_entrer()/_exit(), but in that case you have issues if you ever 
need to nest coprocesor usage.  

> > I'd do something simpler such as adding a MI routine to allocate or 
> > activate a temporary or permanent register save area that can be used by 
> > kernel threads.  
> > 
> > Then, if you want, in the coprocessor trap handler, if you want, if you 
> > are in kernel state you can check whether a kernel save area has been 
> > allocated and panic if not.
> 
> This sounds like a plausible alternative to disabling kpreemption in
> some cases, but it is also orthogonal to my proposal -- in an
> FPU-enabled kthread there is simply no need to allocate an extra save
> area at all because it's already allocated in the lwp pcb, so if a
> subroutine does use the FPU then it's cheaper to call that subroutine
> in an FPU-enabled kthread than otherwise.
> 
> You say it would be simpler -- can you elaborate on how it would
> simplify the implementations that already work on x86 and aarch64 by
> just adding and testing a new flag in a couple places, and enabling or
> disabling the CPU's FPU-enable bit?
> 
> https://anonhg.netbsd.org/src-all/rev/e83ef87e4f53
> https://anonhg.netbsd.org/src-all/rev/7ec4225df101

Franky, I have not looked at either the x86 or aarch64 implementations, 
and it's been a very long time since I last looked at the spar64 
implementation.

The SPARC has always had a lazy FPU save logic.  The fpstate structure is 
not part of the pcb and is allocated on first use.  

When I added the block mem*() routines I piggybacked on that 
implementation.  When a kthread is created the FPU starts out disabled and 
the pointer to the fpstate is NULL.  If userland decides to use the FPU, a 
kernel trap is generated, an fpstate is allocated and added to the 
kthread, and the CPU structure, and the FPU is enabled.  On context 
switches the FPU is disabled.

In a very simplistic description of how I implemented the block copy 
operations, they:

1) check if the FPU is dirty, if it is, save the state to the fpstate in 
the CPU structure.

2) Allocate a new fpstate (usually on the stack) and store a pointer to it 
in the CPU structure.  Save the current kthread's fpstate pointer on the 
stack and replace it with a pointer to the new fpstate.

3) When the block operation is complete, clear the FPU dirty bits, disable 
the FPU and clear the pointer in the CPU structure and restore the fpstate 
ponter in the kthread.

Remembering all this stuff is making my brain hurt.

Eduardo



Home | Main Index | Thread Index | Old Index