NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/57230: set DIT/DOITM bit on arm/x86



>Number:         57230
>Category:       kern
>Synopsis:       set DIT/DOITM bit on arm/x86
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Feb 14 09:55:00 +0000 2023
>Originator:     Taylor R Campbell
>Release:        current
>Organization:
The DoitBSD Foundation
>Environment:
gotta roast it a wee tiny bit more for security
>Description:
Cryptographic secrets can leak through side channels based on timing when cryptographic operations on them take variable time that depends on the secrets.

Some CPU instructions, such as addition and bitwise XOR, traditionally run in constant time independent of their operands -- there's no temptation to make bitwise XOR take a different number of cycles depending on the inputs.  Others, such as division, conditional branches, or loads and stores, typically run in variable time for various reasons (division algorithms, branch prediction, cache hit or miss depending on load/store address).  Modern cryptography software is often limited to instructions that traditionally run in constant time.

However, this behaviour is merely _traditional_ based on the obvious implementation techniques in the logic gates.  It has, until recently, never been _guaranteed_.  Arm and Intel recently added some architectural state bits to enable a guarantee:

- ARMv8.4-DIT (mandatory in Armv8.4) adds PSTATE.DIT (aarch64) and CPSR.DIT (aarch32) bits, for Data Independent Timing.  When this bit is set, certain instructions are guaranteed to run in time independent of the values of any register operands, and loads and stores are guaranteed to run in time independent of the values being loaded or stored (but not independent of the address).  Details: https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/DIT--Data-Independent-Timing

- Newer Intel CPUs have an MSR with a DOITM bit, for Data Operand Independent Timing Mode, similar to the Arm DIT bit.  Details: https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html

- Newer Intel CPUs also appear to have a bug where the DOITM bit isn't quite enough in some instructions that were previously advertised to have data-operand independent timing, such as PMULDQ -- when the floating-point exception status bits are unset in the MXCSR, these instructions sometimes have data-dependent timing.  Setting all the floating-point exception status bits in the MXCSR in advance avoids this leak.  Details: https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/mcdt-data-operand-independent-timing-instructions.html

(Unfortunately, the page on instructions with MXCSR Dependent Timing (MCDT) is resistant to archiving in the Internet Archive for some reason.  Currently the list is: PMADDUBSW PMADDWD PMULDQ PMULHRSW PMULHUW PMULHW PMULLD PMULLW PMULUDQ VPLZCNTD VPLZCNTQ VPMADD52HUQ VPMADD52LUQ VPMADDUBSW VPMADDWD VPMULDQ VPMULHRSW VPMULHUW VPMULHW VPMULLD VPMULLQ VPMULLW VPMULUDQ)

Some options:

1. Set DIT on Arm and DOITM/MXCSR on Intel in the kernel unconditionally, 100% of the time.
2. Set DIT on Arm and DOITM on Intel in the kernel unconditionally, 100% of the time.  Set the MXCSR exception status bits in fpu_kern_enter.
3. Set DIT on Arm and DOITM/MXCSR on Intel in the kernel in fpu_kern_enter, and restore it on fpu_kern_leave.

Exactly what performance impact to expect is unclear -- maybe Arm and Intel will do something bonkers and make XOR take longer with the DIT/DOITM bit set, but that seems unlikely because you'd have to go out of your way to design an XOR instruction that takes variable time anyway.

More likely, I think, some of the fancier vectorized operations that already have a long latency which is tempting to make slightly variable -- e.g., maybe take one cycle longer to set a condition code at the end -- might be altered to always take the maximum latency.  Of course, many instructions which currently run in variable time anyway, such as division, are unaffected by the DIT/DOITM bit and must be still avoided for handling secrets.

Only some cryptographic code in the kernel is bracketed by fpu_kern_enter/leave -- just the code with MD vectorized implementations.  None of the portable C implementations of cryptographic primitives use this; they run in the normal mode of the kernel.

Further, it would be bad if, for example, copyin and copyout had data-dependent timing when transferring secrets through a pipe, or if reading or writing data in swap took time that depends on the bits of the data.

So I think we should do option (1).

Some further discussion:
https://seclists.org/oss-sec/2023/q1/52
https://lkml.org/lkml/2023/1/24/1393

Note: There are also timing side channels based on dynamic voltage and frequency scaling (see, e.g., https://www.hertzbleed.com/hertzbleed.pdf and https://arxiv.org/pdf/2206.13660v1.pdf).  This is not about that -- this is only about the new architectural DIT/DOITM bits.
>How-To-Repeat:
ask CPU designers to codify microarchitectural guarantees like https://xkcd.com/1172/
>Fix:
Yes please!



Home | Main Index | Thread Index | Old Index