Documentation/security/self-protection.rst

   1 ======================
   2 Kernel Self-Protection
   3 ======================
   4
   5 Kernel self-protection is the design and implementation of systems and
   6 structures within the Linux kernel to protect against security flaws in
   7 the kernel itself. This covers a wide range of issues, including removing
   8 entire classes of bugs, blocking security flaw exploitation methods,
   9 and actively detecting attack attempts. Not all topics are explored in
  10 this document, but it should serve as a reasonable starting point and
  11 answer any frequently asked questions. (Patches welcome, of course!)
  12
  13 In the worst-case scenario, we assume an unprivileged local attacker
  14 has arbitrary read and write access to the kernel's memory. In many
  15 cases, bugs being exploited will not provide this level of access,
  16 but with systems in place that defend against the worst case we'll
  17 cover the more limited cases as well. A higher bar, and one that should
  18 still be kept in mind, is protecting the kernel against a _privileged_
  19 local attacker, since the root user has access to a vastly increased
  20 attack surface. (Especially when they have the ability to load arbitrary
  21 kernel modules.)
  22
  23 The goals for successful self-protection systems would be that they
  24 are effective, on by default, require no opt-in by developers, have no
  25 performance impact, do not impede kernel debugging, and have tests. It
  26 is uncommon that all these goals can be met, but it is worth explicitly
  27 mentioning them, since these aspects need to be explored, dealt with,
  28 and/or accepted.
  29
  30
  31 Attack Surface Reduction
  32 ========================
  33
  34 The most fundamental defense against security exploits is to reduce the
  35 areas of the kernel that can be used to redirect execution. This ranges
  36 from limiting the exposed APIs available to userspace, making in-kernel
  37 APIs hard to use incorrectly, minimizing the areas of writable kernel
  38 memory, etc.
  39
  40 Strict kernel memory permissions
  41 --------------------------------
  42
  43 When all of kernel memory is writable, it becomes trivial for attacks
  44 to redirect execution flow. To reduce the availability of these targets
  45 the kernel needs to protect its memory with a tight set of permissions.
  46
  47 Executable code and read-only data must not be writable
  48 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  49
  50 Any areas of the kernel with executable memory must not be writable.
  51 While this obviously includes the kernel text itself, we must consider
  52 all additional places too: kernel modules, JIT memory, etc. (There are
  53 temporary exceptions to this rule to support things like instruction
  54 alternatives, breakpoints, kprobes, etc. If these must exist in a
  55 kernel, they are implemented in a way where the memory is temporarily
  56 made writable during the update, and then returned to the original
  57 permissions.)
  58
  59 In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
  60 ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
  61 writable, data is not executable, and read-only data is neither writable
  62 nor executable.
  63
  64 Most architectures have these options on by default and not user selectable.
  65 For some architectures like arm that wish to have these be selectable,
  66 the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
  67 a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
  68 the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
  69
  70 Function pointers and sensitive variables must not be writable
  71 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  72
  73 Vast areas of kernel memory contain function pointers that are looked
  74 up by the kernel and used to continue execution (e.g. descriptor/vector
  75 tables, file/network/etc operation structures, etc). The number of these
  76 variables must be reduced to an absolute minimum.
  77
  78 Many such variables can be made read-only by setting them "const"
  79 so that they live in the .rodata section instead of the .data section
  80 of the kernel, gaining the protection of the kernel's strict memory
  81 permissions as described above.
  82
  83 For variables that are initialized once at ``__init`` time, these can
  84 be marked with the (new and under development) ``__ro_after_init``
  85 attribute.
  86
  87 What remains are variables that are updated rarely (e.g. GDT). These
  88 will need another infrastructure (similar to the temporary exceptions
  89 made to kernel code mentioned above) that allow them to spend the rest
  90 of their lifetime read-only. (For example, when being updated, only the
  91 CPU thread performing the update would be given uninterruptible write
  92 access to the memory.)
  93
  94 Segregation of kernel memory from userspace memory
  95 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  96
  97 The kernel must never execute userspace memory. The kernel must also never
  98 access userspace memory without explicit expectation to do so. These
  99 rules can be enforced either by support of hardware-based restrictions
 100 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
 101 By blocking userspace memory in this way, execution and data parsing
 102 cannot be passed to trivially-controlled userspace memory, forcing
 103 attacks to operate entirely in kernel memory.
 104
 105 Reduced access to syscalls
 106 --------------------------
 107
 108 One trivial way to eliminate many syscalls for 64-bit systems is building
 109 without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
 110
 111 The "seccomp" system provides an opt-in feature made available to
 112 userspace, which provides a way to reduce the number of kernel entry
 113 points available to a running process. This limits the breadth of kernel
 114 code that can be reached, possibly reducing the availability of a given
 115 bug to an attack.
 116
 117 An area of improvement would be creating viable ways to keep access to
 118 things like compat, user namespaces, BPF creation, and perf limited only
 119 to trusted processes. This would keep the scope of kernel entry points
 120 restricted to the more regular set of normally available to unprivileged
 121 userspace.
 122
 123 Restricting access to kernel modules
 124 ------------------------------------
 125
 126 The kernel should never allow an unprivileged user the ability to
 127 load specific kernel modules, since that would provide a facility to
 128 unexpectedly extend the available attack surface. (The on-demand loading
 129 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
 130 considered "expected" here, though additional consideration should be
 131 given even to these.) For example, loading a filesystem module via an
 132 unprivileged socket API is nonsense: only the root or physically local
 133 user should trigger filesystem module loading. (And even this can be up
 134 for debate in some scenarios.)
 135
 136 To protect against even privileged users, systems may need to either
 137 disable module loading entirely (e.g. monolithic kernel builds or
 138 modules_disabled sysctl), or provide signed modules (e.g.
 139 ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
 140 root load arbitrary kernel code via the module loader interface.
 141
 142
 143 Memory integrity
 144 ================
 145
 146 There are many memory structures in the kernel that are regularly abused
 147 to gain execution control during an attack, By far the most commonly
 148 understood is that of the stack buffer overflow in which the return
 149 address stored on the stack is overwritten. Many other examples of this
 150 kind of attack exist, and protections exist to defend against them.
 151
 152 Stack buffer overflow
 153 ---------------------
 154
 155 The classic stack buffer overflow involves writing past the expected end
 156 of a variable stored on the stack, ultimately writing a controlled value
 157 to the stack frame's stored return address. The most widely used defense
 158 is the presence of a stack canary between the stack variables and the
 159 return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before
 160 the function returns. Other defenses include things like shadow stacks.
 161
 162 Stack depth overflow
 163 --------------------
 164
 165 A less well understood attack is using a bug that triggers the
 166 kernel to consume stack memory with deep function calls or large stack
 167 allocations. With this attack it is possible to write beyond the end of
 168 the kernel's preallocated stack space and into sensitive structures. Two
 169 important changes need to be made for better protections: moving the
 170 sensitive thread_info structure elsewhere, and adding a faulting memory
 171 hole at the bottom of the stack to catch these overflows.
 172
 173 Heap memory integrity
 174 ---------------------
 175
 176 The structures used to track heap free lists can be sanity-checked during
 177 allocation and freeing to make sure they aren't being used to manipulate
 178 other memory areas.
 179
 180 Counter integrity
 181 -----------------
 182
 183 Many places in the kernel use atomic counters to track object references
 184 or perform similar lifetime management. When these counters can be made
 185 to wrap (over or under) this traditionally exposes a use-after-free
 186 flaw. By trapping atomic wrapping, this class of bug vanishes.
 187
 188 Size calculation overflow detection
 189 -----------------------------------
 190
 191 Similar to counter overflow, integer overflows (usually size calculations)
 192 need to be detected at runtime to kill this class of bug, which
 193 traditionally leads to being able to write past the end of kernel buffers.
 194
 195
 196 Probabilistic defenses
 197 ======================
 198
 199 While many protections can be considered deterministic (e.g. read-only
 200 memory cannot be written to), some protections provide only statistical
 201 defense, in that an attack must gather enough information about a
 202 running system to overcome the defense. While not perfect, these do
 203 provide meaningful defenses.
 204
 205 Canaries, blinding, and other secrets
 206 -------------------------------------
 207
 208 It should be noted that things like the stack canary discussed earlier
 209 are technically statistical defenses, since they rely on a secret value,
 210 and such values may become discoverable through an information exposure
 211 flaw.
 212
 213 Blinding literal values for things like JITs, where the executable
 214 contents may be partially under the control of userspace, need a similar
 215 secret value.
 216
 217 It is critical that the secret values used must be separate (e.g.
 218 different canary per stack) and high entropy (e.g. is the RNG actually
 219 working?) in order to maximize their success.
 220
 221 Kernel Address Space Layout Randomization (KASLR)
 222 -------------------------------------------------
 223
 224 Since the location of kernel memory is almost always instrumental in
 225 mounting a successful attack, making the location non-deterministic
 226 raises the difficulty of an exploit. (Note that this in turn makes
 227 the value of information exposures higher, since they may be used to
 228 discover desired memory locations.)
 229
 230 Text and module base
 231 ~~~~~~~~~~~~~~~~~~~~
 232
 233 By relocating the physical and virtual base address of the kernel at
 234 boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
 235 frustrated. Additionally, offsetting the module loading base address
 236 means that even systems that load the same set of modules in the same
 237 order every boot will not share a common base address with the rest of
 238 the kernel text.
 239
 240 Stack base
 241 ~~~~~~~~~~
 242
 243 If the base address of the kernel stack is not the same between processes,
 244 or even not the same between syscalls, targets on or beyond the stack
 245 become more difficult to locate.
 246
 247 Dynamic memory base
 248 ~~~~~~~~~~~~~~~~~~~
 249
 250 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
 251 being relatively deterministic in layout due to the order of early-boot
 252 initializations. If the base address of these areas is not the same
 253 between boots, targeting them is frustrated, requiring an information
 254 exposure specific to the region.
 255
 256 Structure layout
 257 ~~~~~~~~~~~~~~~~
 258
 259 By performing a per-build randomization of the layout of sensitive
 260 structures, attacks must either be tuned to known kernel builds or expose
 261 enough kernel memory to determine structure layouts before manipulating
 262 them.
 263
 264
 265 Preventing Information Exposures
 266 ================================
 267
 268 Since the locations of sensitive structures are the primary target for
 269 attacks, it is important to defend against exposure of both kernel memory
 270 addresses and kernel memory contents (since they may contain kernel
 271 addresses or other sensitive things like canary values).
 272
 273 Kernel addresses
 274 ----------------
 275
 276 Printing kernel addresses to userspace leaks sensitive information about
 277 the kernel memory layout. Care should be exercised when using any printk
 278 specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
 279 in certain circumstances [*]).  Any file written to using one of these
 280 specifiers should be readable only by privileged processes.
 281
 282 Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
 283 addresses printed with the specifier %p are hashed before printing.
 284
 285 [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
 286 printed. If KALLSYMS is not enabled the raw address is printed.
 287
 288 Unique identifiers
 289 ------------------
 290
 291 Kernel memory addresses must never be used as identifiers exposed to
 292 userspace. Instead, use an atomic counter, an idr, or similar unique
 293 identifier.
 294
 295 Memory initialization
 296 ---------------------
 297
 298 Memory copied to userspace must always be fully initialized. If not
 299 explicitly memset(), this will require changes to the compiler to make
 300 sure structure holes are cleared.
 301
 302 Memory poisoning
 303 ----------------
 304
 305 When releasing memory, it is best to poison the contents (clear stack on
 306 syscall return, wipe heap memory on a free), to avoid reuse attacks that
 307 rely on the old contents of memory. This frustrates many uninitialized
 308 variable attacks, stack content exposures, heap content exposures, and
 309 use-after-free attacks.
 310
 311 Destination tracking
 312 --------------------
 313
 314 To help kill classes of bugs that result in kernel addresses being
 315 written to userspace, the destination of writes needs to be tracked. If
 316 the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
 317 it should automatically censor sensitive values.