Documentation/x86/intel_mpx.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ===========================================
   4 Intel(R) Memory Protection Extensions (MPX)
   5 ===========================================
   6
   7 Intel(R) MPX Overview
   8 =====================
   9
  10 Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
  11 introduced into Intel Architecture. Intel MPX provides hardware features
  12 that can be used in conjunction with compiler changes to check memory
  13 references, for those references whose compile-time normal intentions are
  14 usurped at runtime due to buffer overflow or underflow.
  15
  16 You can tell if your CPU supports MPX by looking in /proc/cpuinfo::
  17
  18         cat /proc/cpuinfo  | grep ' mpx '
  19
  20 For more information, please refer to Intel(R) Architecture Instruction
  21 Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
  22 Extensions.
  23
  24 Note: As of December 2014, no hardware with MPX is available but it is
  25 possible to use SDE (Intel(R) Software Development Emulator) instead, which
  26 can be downloaded from
  27 http://software.intel.com/en-us/articles/intel-software-development-emulator
  28
  29
  30 How to get the advantage of MPX
  31 ===============================
  32
  33 For MPX to work, changes are required in the kernel, binutils and compiler.
  34 No source changes are required for applications, just a recompile.
  35
  36 There are a lot of moving parts of this to all work right. The following
  37 is how we expect the compiler, application and kernel to work together.
  38
  39 1) Application developer compiles with -fmpx. The compiler will add the
  40    instrumentation as well as some setup code called early after the app
  41    starts. New instruction prefixes are noops for old CPUs.
  42 2) That setup code allocates (virtual) space for the "bounds directory",
  43    points the "bndcfgu" register to the directory (must also set the valid
  44    bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT))
  45    that the app will be using MPX.  The app must be careful not to access
  46    the bounds tables between the time when it populates "bndcfgu" and
  47    when it calls the prctl().  This might be hard to guarantee if the app
  48    is compiled with MPX.  You can add "__attribute__((bnd_legacy))" to
  49    the function to disable MPX instrumentation to help guarantee this.
  50    Also be careful not to call out to any other code which might be
  51    MPX-instrumented.
  52 3) The kernel detects that the CPU has MPX, allows the new prctl() to
  53    succeed, and notes the location of the bounds directory. Userspace is
  54    expected to keep the bounds directory at that location. We note it
  55    instead of reading it each time because the 'xsave' operation needed
  56    to access the bounds directory register is an expensive operation.
  57 4) If the application needs to spill bounds out of the 4 registers, it
  58    issues a bndstx instruction. Since the bounds directory is empty at
  59    this point, a bounds fault (#BR) is raised, the kernel allocates a
  60    bounds table (in the user address space) and makes the relevant entry
  61    in the bounds directory point to the new table.
  62 5) If the application violates the bounds specified in the bounds registers,
  63    a separate kind of #BR is raised which will deliver a signal with
  64    information about the violation in the 'struct siginfo'.
  65 6) Whenever memory is freed, we know that it can no longer contain valid
  66    pointers, and we attempt to free the associated space in the bounds
  67    tables. If an entire table becomes unused, we will attempt to free
  68    the table and remove the entry in the directory.
  69
  70 To summarize, there are essentially three things interacting here:
  71
  72 GCC with -fmpx:
  73  * enables annotation of code with MPX instructions and prefixes
  74  * inserts code early in the application to call in to the "gcc runtime"
  75 GCC MPX Runtime:
  76  * Checks for hardware MPX support in cpuid leaf
  77  * allocates virtual space for the bounds directory (malloc() essentially)
  78  * points the hardware BNDCFGU register at the directory
  79  * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
  80    start managing the bounds directories
  81 Kernel MPX Code:
  82  * Checks for hardware MPX support in cpuid leaf
  83  * Handles #BR exceptions and sends SIGSEGV to the app when it violates
  84    bounds, like during a buffer overflow.
  85  * When bounds are spilled in to an unallocated bounds table, the kernel
  86    notices in the #BR exception, allocates the virtual space, then
  87    updates the bounds directory to point to the new table. It keeps
  88    special track of the memory with a VM_MPX flag.
  89  * Frees unused bounds tables at the time that the memory they described
  90    is unmapped.
  91
  92
  93 How does MPX kernel code work
  94 =============================
  95
  96 Handling #BR faults caused by MPX
  97 ---------------------------------
  98
  99 When MPX is enabled, there are 2 new situations that can generate
 100 #BR faults.
 101
 102   * new bounds tables (BT) need to be allocated to save bounds.
 103   * bounds violation caused by MPX instructions.
 104
 105 We hook #BR handler to handle these two new situations.
 106
 107 On-demand kernel allocation of bounds tables
 108 --------------------------------------------
 109
 110 MPX only has 4 hardware registers for storing bounds information. If
 111 MPX-enabled code needs more than these 4 registers, it needs to spill
 112 them somewhere. It has two special instructions for this which allow
 113 the bounds to be moved between the bounds registers and some new "bounds
 114 tables".
 115
 116 #BR exceptions are a new class of exceptions just for MPX. They are
 117 similar conceptually to a page fault and will be raised by the MPX
 118 hardware during both bounds violations or when the tables are not
 119 present. The kernel handles those #BR exceptions for not-present tables
 120 by carving the space out of the normal processes address space and then
 121 pointing the bounds-directory over to it.
 122
 123 The tables need to be accessed and controlled by userspace because
 124 the instructions for moving bounds in and out of them are extremely
 125 frequent. They potentially happen every time a register points to
 126 memory. Any direct kernel involvement (like a syscall) to access the
 127 tables would obviously destroy performance.
 128
 129 Why not do this in userspace? MPX does not strictly require anything in
 130 the kernel. It can theoretically be done completely from userspace. Here
 131 are a few ways this could be done. We don't think any of them are practical
 132 in the real-world, but here they are.
 133
 134 :Q: Can virtual space simply be reserved for the bounds tables so that we
 135     never have to allocate them?
 136 :A: MPX-enabled application will possibly create a lot of bounds tables in
 137     process address space to save bounds information. These tables can take
 138     up huge swaths of memory (as much as 80% of the memory on the system)
 139     even if we clean them up aggressively. In the worst-case scenario, the
 140     tables can be 4x the size of the data structure being tracked. IOW, a
 141     1-page structure can require 4 bounds-table pages. An X-GB virtual
 142     area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
 143     If we were to preallocate them for the 128TB of user virtual address
 144     space, we would need to reserve 512TB+2GB, which is larger than the
 145     entire virtual address space today. This means they can not be reserved
 146     ahead of time. Also, a single process's pre-populated bounds directory
 147     consumes 2GB of virtual *AND* physical memory. IOW, it's completely
 148     infeasible to prepopulate bounds directories.
 149
 150 :Q: Can we preallocate bounds table space at the same time memory is
 151     allocated which might contain pointers that might eventually need
 152     bounds tables?
 153 :A: This would work if we could hook the site of each and every memory
 154     allocation syscall. This can be done for small, constrained applications.
 155     But, it isn't practical at a larger scale since a given app has no
 156     way of controlling how all the parts of the app might allocate memory
 157     (think libraries). The kernel is really the only place to intercept
 158     these calls.
 159
 160 :Q: Could a bounds fault be handed to userspace and the tables allocated
 161     there in a signal handler instead of in the kernel?
 162 :A: mmap() is not on the list of safe async handler functions and even
 163     if mmap() would work it still requires locking or nasty tricks to
 164     keep track of the allocation state there.
 165
 166 Having ruled out all of the userspace-only approaches for managing
 167 bounds tables that we could think of, we create them on demand in
 168 the kernel.
 169
 170 Decoding MPX instructions
 171 -------------------------
 172
 173 If a #BR is generated due to a bounds violation caused by MPX.
 174 We need to decode MPX instructions to get violation address and
 175 set this address into extended struct siginfo.
 176
 177 The _sigfault field of struct siginfo is extended as follow::
 178
 179   87            /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
 180   88            struct {
 181   89                    void __user *_addr; /* faulting insn/memory ref. */
 182   90 #ifdef __ARCH_SI_TRAPNO
 183   91                    int _trapno;    /* TRAP # which caused the signal */
 184   92 #endif
 185   93                    short _addr_lsb; /* LSB of the reported address */
 186   94                    struct {
 187   95                            void __user *_lower;
 188   96                            void __user *_upper;
 189   97                    } _addr_bnd;
 190   98            } _sigfault;
 191
 192 The '_addr' field refers to violation address, and new '_addr_and'
 193 field refers to the upper/lower bounds when a #BR is caused.
 194
 195 Glibc will be also updated to support this new siginfo. So user
 196 can get violation address and bounds when bounds violations occur.
 197
 198 Cleanup unused bounds tables
 199 ----------------------------
 200
 201 When a BNDSTX instruction attempts to save bounds to a bounds directory
 202 entry marked as invalid, a #BR is generated. This is an indication that
 203 no bounds table exists for this entry. In this case the fault handler
 204 will allocate a new bounds table on demand.
 205
 206 Since the kernel allocated those tables on-demand without userspace
 207 knowledge, it is also responsible for freeing them when the associated
 208 mappings go away.
 209
 210 Here, the solution for this issue is to hook do_munmap() to check
 211 whether one process is MPX enabled. If yes, those bounds tables covered
 212 in the virtual address region which is being unmapped will be freed also.
 213
 214 Adding new prctl commands
 215 -------------------------
 216
 217 Two new prctl commands are added to enable and disable MPX bounds tables
 218 management in kernel.
 219 ::
 220
 221   155   #define PR_MPX_ENABLE_MANAGEMENT        43
 222   156   #define PR_MPX_DISABLE_MANAGEMENT       44
 223
 224 Runtime library in userspace is responsible for allocation of bounds
 225 directory. So kernel have to use XSAVE instruction to get the base
 226 of bounds directory from BNDCFG register.
 227
 228 But XSAVE is expected to be very expensive. In order to do performance
 229 optimization, we have to get the base of bounds directory and save it
 230 into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
 231 command execution.
 232
 233
 234 Special rules
 235 =============
 236
 237 1) If userspace is requesting help from the kernel to do the management
 238 of bounds tables, it may not create or modify entries in the bounds directory.
 239
 240 Certainly users can allocate bounds tables and forcibly point the bounds
 241 directory at them through XSAVE instruction, and then set valid bit
 242 of bounds entry to have this entry valid.  But, the kernel will decline
 243 to assist in managing these tables.
 244
 245 2) Userspace may not take multiple bounds directory entries and point
 246 them at the same bounds table.
 247
 248 This is allowed architecturally.  See more information "Intel(R) Architecture
 249 Instruction Set Extensions Programming Reference" (9.3.4).
 250
 251 However, if users did this, the kernel might be fooled in to unmapping an
 252 in-use bounds table since it does not recognize sharing.