Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src/sys/arch/x86/x86 Document SVS. Also, remove an entry from th...



details:   https://anonhg.NetBSD.org/src/rev/19041d756197
branches:  trunk
changeset: 830051:19041d756197
user:      maxv <maxv%NetBSD.org@localhost>
date:      Sat Feb 24 10:31:30 2018 +0000

description:
Document SVS. Also, remove an entry from the todo list.

diffstat:

 sys/arch/x86/x86/svs.c |  202 ++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 167 insertions(+), 35 deletions(-)

diffs (248 lines):

diff -r 971523079250 -r 19041d756197 sys/arch/x86/x86/svs.c
--- a/sys/arch/x86/x86/svs.c    Sat Feb 24 07:53:15 2018 +0000
+++ b/sys/arch/x86/x86/svs.c    Sat Feb 24 10:31:30 2018 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: svs.c,v 1.9 2018/02/23 19:39:27 maxv Exp $     */
+/*     $NetBSD: svs.c,v 1.10 2018/02/24 10:31:30 maxv Exp $    */
 
 /*
  * Copyright (c) 2018 The NetBSD Foundation, Inc.
@@ -30,7 +30,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: svs.c,v 1.9 2018/02/23 19:39:27 maxv Exp $");
+__KERNEL_RCSID(0, "$NetBSD: svs.c,v 1.10 2018/02/24 10:31:30 maxv Exp $");
 
 #include "opt_svs.h"
 
@@ -52,48 +52,179 @@
  * Separate Virtual Space
  *
  * A per-cpu L4 page is maintained in ci_svs_updirpa. During each context
- * switch to a user pmap, updirpa is populated with the entries of the new
- * pmap, minus what we don't want to have mapped in userland.
+ * switch to a user pmap, the lower half of updirpa is populated with the
+ * entries containing the userland pages.
+ *
+ * ~~~~~~~~~~ The UTLS Page ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * 
+ * We use a special per-cpu page that we call UTLS, for User Thread Local
+ * Storage. Each CPU has one UTLS page. This page has two VAs:
+ *
+ *  o When the user page tables are loaded in CR3, the VA to access this
+ *    page is &pcpuarea->utls, defined as SVS_UTLS+UTLS_KPDIRPA in assembly.
+ *    This VA is _constant_ across CPUs, but in the user page tables this
+ *    VA points to the physical page of the UTLS that is _local_ to the CPU.
+ *
+ *  o When the kernel page tables are loaded in CR3, the VA to access this
+ *    page is ci->ci_svs_utls.
  *
- * Note on locking/synchronization here:
+ * +----------------------------------------------------------------------+
+ * | CPU0 Local Data                                      (Physical Page) |
+ * | +------------------+                                 +-------------+ |
+ * | | User Page Tables | SVS_UTLS+UTLS_KPDIRPA --------> | cpu0's UTLS | |
+ * | +------------------+                                 +-------------+ |
+ * +-------------------------------------------------------------^--------+
+ *                                                               |
+ *                                                               +----------+
+ *                                                                          |
+ * +----------------------------------------------------------------------+ |
+ * | CPU1 Local Data                                      (Physical Page) | |
+ * | +------------------+                                 +-------------+ | |
+ * | | User Page Tables | SVS_UTLS+UTLS_KPDIRPA --------> | cpu1's UTLS | | |
+ * | +------------------+                                 +-------------+ | |
+ * +-------------------------------------------------------------^--------+ |
+ *                                                               |          |
+ *   +------------------+                 /----------------------+          |
+ *   | Kern Page Tables | ci->ci_svs_utls                                   |
+ *   +------------------+                 \---------------------------------+
  *
- * (a) Touching ci_svs_updir without holding ci_svs_mtx first is *not*
- *     allowed.
+ * The goal of the UTLS page is to provide an area where we can store whatever
+ * we want, in a way that it is accessible both when the Kernel and when the
+ * User page tables are loaded in CR3.
+ *
+ * We store in the UTLS page three 64bit values:
  *
- * (b) pm_kernel_cpus contains the set of CPUs that have the pmap loaded
- *     in their CR3 register. It must *not* be replaced by pm_cpus.
+ *  o UTLS_KPDIRPA: the value we must put in CR3 in order to load the kernel
+ *    page tables.
+ *
+ *  o UTLS_SCRATCH: a dummy place where we temporarily store a value during
+ *    the syscall entry procedure.
+ *
+ *  o UTLS_RSP0: the value we must put in RSP in order to have a stack where
+ *    we can push the register states. This is used only during the syscall
+ *    entry procedure, because there the CPU does not automatically switch
+ *    RSP (it does not use the TSS.rsp0 mechanism described below).
+ *
+ * ~~~~~~~~~~ The Stack Switching Mechanism Without SVS ~~~~~~~~~~~~~~~~~~~~~~
+ *
+ * The kernel stack is per-lwp (pcb_rsp0). When doing a context switch between
+ * two user LWPs, the kernel updates TSS.rsp0 (which is per-cpu) to point to
+ * the stack of the new LWP. Then the execution continues. At some point, the
+ * user LWP we context-switched to will perform a syscall or will receive an
+ * interrupt. There, the CPU will automatically read TSS.rsp0 and use it as a
+ * stack. The kernel then pushes the register states on this stack, and
+ * executes in kernel mode normally.
  *
- * (c) When a context switch on the current CPU is made from a user LWP
- *     towards a kernel LWP, CR3 is not updated. Therefore, the pmap's
- *     pm_kernel_cpus still contains the current CPU. It implies that the
- *     remote CPUs that execute other threads of the user process we just
- *     left will keep synchronizing us against their changes.
+ * TSS.rsp0 is used by the CPU only during ring3->ring0 transitions. Therefore,
+ * when an interrupt is received while we were in kernel mode, the CPU does not
+ * read TSS.rsp0. Instead, it just uses the current stack.
+ *
+ * ~~~~~~~~~~ The Stack Switching Mechanism With SVS ~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ * In the pcpu_area structure, pointed to by the "pcpuarea" variable, each CPU
+ * has a two-page rsp0 entry (pcpuarea->ent[cid].rsp0). These two pages do
+ * _not_ have associated physical addresses. They are only two VAs.
+ *
+ * The first page is unmapped and acts as a redzone. The second page is
+ * dynamically kentered into the highest page of the real per-lwp kernel stack;
+ * but pay close attention, it is kentered _only_ in the user page tables.
+ * That is to say, the VA of this second page is mapped when the user page
+ * tables are loaded, but not mapped when the kernel page tables are loaded.
+ *
+ * During a context switch, svs_lwp_switch() gets called first. This function
+ * does the kenter job described above, not in the kernel page tables (that
+ * are currently loaded), but in the user page tables (that are not loaded).
+ * 
+ *           VIRTUAL ADDRESSES                     PHYSICAL ADDRESSES
  *
- * List of areas that are removed from userland:
- *     PTE Space         [OK]
- *     Direct Map        [OK]
- *     Remote PCPU Areas [OK]
- *     Kernel Heap       [OK]
- *     Kernel Image      [OK]
+ * +-----------------------------+
+ * |      KERNEL PAGE TABLES     |
+ * |    +-------------------+    |                +-------------------+
+ * |    | pcb_rsp0 (page 0) | ------------------> | pcb_rsp0 (page 0) |
+ * |    +-------------------+    |                +-------------------+
+ * |    | pcb_rsp0 (page 1) | ------------------> | pcb_rsp0 (page 1) |
+ * |    +-------------------+    |                +-------------------+
+ * |    | pcb_rsp0 (page 2) | ------------------> | pcb_rsp0 (page 2) |
+ * |    +-------------------+    |                +-------------------+
+ * |    | pcb_rsp0 (page 3) | ------------------> | pcb_rsp0 (page 3) |
+ * |    +-------------------+    |            +-> +-------------------+
+ * +-----------------------------+            |
+ *                                            |
+ * +---------------------------------------+  |
+ * |           USER PAGE TABLES            |  |
+ * | +----------------------------------+  |  |
+ * | | pcpuarea->ent[cid].rsp0 (page 0) |  |  |
+ * | +----------------------------------+  |  |
+ * | | pcpuarea->ent[cid].rsp0 (page 1) | ----+
+ * | +----------------------------------+  |
+ * +---------------------------------------+
  *
- * TODO:
+ * After svs_lwp_switch() gets called, we set pcpuarea->ent[cid].rsp0 (page 1)
+ * in TSS.rsp0. Later, when returning to userland on the lwp we context-
+ * switched to, we will load the user page tables and execute in userland
+ * normally.
+ *
+ * Next time an interrupt or syscall is received, the CPU will automatically
+ * use TSS.rsp0 as a stack. Here it is executing with the user page tables
+ * loaded, and therefore TSS.rsp0 is _mapped_.
+ *
+ * As part of the kernel entry procedure, we now switch CR3 to load the kernel
+ * page tables. Here, we are still using the stack pointer we set in TSS.rsp0.
+ *
+ * Remember that it was only one page of stack which was mapped only in the
+ * user page tables. We just switched to the kernel page tables, so we must
+ * update RSP to be the real per-lwp kernel stack (pcb_rsp0). And we do so,
+ * without touching the stack (since it is now unmapped, touching it would
+ * fault).
  *
- * (a) The NMI stack is not double-entered. Therefore if we ever receive
- *     an NMI and leave it, the content of the stack will be visible to
- *     userland (via Meltdown). Normally we never leave NMIs, unless a
- *     privileged user launched PMCs. That's unlikely to happen, our PMC
- *     support is pretty minimal.
+ * After we updated RSP, we can continue execution exactly as in the non-SVS
+ * case. We don't need to copy the values the CPU pushed on TSS.rsp0: even if
+ * we updated RSP to a totally different VA, this VA points to the same
+ * physical page as TSS.rsp0. So in the end, the values the CPU pushed are
+ * still here even with the new RSP.
+ *
+ * Thanks to this double-kenter optimization, we don't need to copy the
+ * trapframe during each user<->kernel transition.
+ *
+ * ~~~~~~~~~~ Notes On Locking And Synchronization ~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ *  o Touching ci_svs_updir without holding ci_svs_mtx first is *not*
+ *    allowed.
+ *
+ *  o pm_kernel_cpus contains the set of CPUs that have the pmap loaded
+ *    in their CR3 register. It must *not* be replaced by pm_cpus.
+ *
+ *  o When a context switch on the current CPU is made from a user LWP
+ *    towards a kernel LWP, CR3 is not updated. Therefore, the pmap's
+ *    pm_kernel_cpus still contains the current CPU. It implies that the
+ *    remote CPUs that execute other threads of the user process we just
+ *    left will keep synchronizing us against their changes.
  *
- * (b) Enable SVS depending on the CPU model, and add a sysctl to disable
- *     it dynamically.
+ * ~~~~~~~~~~ List Of Areas That Are Removed From Userland ~~~~~~~~~~~~~~~~~~~
+ *
+ *  o PTE Space
+ *  o Direct Map
+ *  o Remote PCPU Areas
+ *  o Kernel Heap
+ *  o Kernel Image
+ *
+ * ~~~~~~~~~~ Todo List ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ * Ordered from highest priority to lowest:
  *
- * (c) Narrow down the entry points: hide the 'jmp handler' instructions.
- *     This makes sense on GENERIC_KASLR kernels.
+ *  o Handle segment register faults properly.
  *
- * (d) Right now there is only one global LDT, and that's not compatible
- *     with USER_LDT.
+ *  o The NMI stack is not double-entered. Therefore if we ever receive an NMI
+ *    and leave it, the content of the stack will be visible to userland (via
+ *    Meltdown). Normally we never leave NMIs, unless a privileged user
+ *    launched PMCs. That's unlikely to happen, our PMC support is pretty
+ *    minimal, and privileged only.
  *
- * (e) Handle segment register faults properly.
+ *  o Narrow down the entry points: hide the 'jmp handler' instructions. This
+ *    makes sense on GENERIC_KASLR kernels.
+ *
+ *  o Right now there is only one global LDT, and that's not compatible with
+ *    USER_LDT.
  */
 
 bool svs_enabled __read_mostly = false;
@@ -225,7 +356,7 @@
        paddr_t pa;
        vaddr_t va;
 
-       /* Create levels L4, L3 and L2. */
+       /* Create levels L4, L3 and L2 of the UTLS page. */
        pd = svs_tree_add(ci, utlsva);
 
        /* Allocate L1. */
@@ -247,7 +378,8 @@
 
        /*
         * Now, allocate a VA in the kernel map, that points to the UTLS
-        * page.
+        * page. After that, the UTLS page will be accessible in kernel
+        * mode via ci_svs_utls.
         */
        va = uvm_km_alloc(kernel_map, PAGE_SIZE, 0,
            UVM_KMF_VAONLY|UVM_KMF_NOWAIT);



Home | Main Index | Thread Index | Old Index