Documentation/admin-guide/mm/concepts.rst

   1 .. _mm_concepts:
   2
   3 =================
   4 Concepts overview
   5 =================
   6
   7 The memory management in Linux is a complex system that evolved over the
   8 years and included more and more functionality to support a variety of
   9 systems from MMU-less microcontrollers to supercomputers. The memory
  10 management for systems without an MMU is called ``nommu`` and it
  11 definitely deserves a dedicated document, which hopefully will be
  12 eventually written. Yet, although some of the concepts are the same,
  13 here we assume that an MMU is available and a CPU can translate a virtual
  14 address to a physical address.
  15
  16 .. contents:: :local:
  17
  18 Virtual Memory Primer
  19 =====================
  20
  21 The physical memory in a computer system is a limited resource and
  22 even for systems that support memory hotplug there is a hard limit on
  23 the amount of memory that can be installed. The physical memory is not
  24 necessarily contiguous; it might be accessible as a set of distinct
  25 address ranges. Besides, different CPU architectures, and even
  26 different implementations of the same architecture have different views
  27 of how these address ranges are defined.
  28
  29 All this makes dealing directly with physical memory quite complex and
  30 to avoid this complexity a concept of virtual memory was developed.
  31
  32 The virtual memory abstracts the details of physical memory from the
  33 application software, allows to keep only needed information in the
  34 physical memory (demand paging) and provides a mechanism for the
  35 protection and controlled sharing of data between processes.
  36
  37 With virtual memory, each and every memory access uses a virtual
  38 address. When the CPU decodes the an instruction that reads (or
  39 writes) from (or to) the system memory, it translates the `virtual`
  40 address encoded in that instruction to a `physical` address that the
  41 memory controller can understand.
  42
  43 The physical system memory is divided into page frames, or pages. The
  44 size of each page is architecture specific. Some architectures allow
  45 selection of the page size from several supported values; this
  46 selection is performed at the kernel build time by setting an
  47 appropriate kernel configuration option.
  48
  49 Each physical memory page can be mapped as one or more virtual
  50 pages. These mappings are described by page tables that allow
  51 translation from a virtual address used by programs to the physical
  52 memory address. The page tables are organized hierarchically.
  53
  54 The tables at the lowest level of the hierarchy contain physical
  55 addresses of actual pages used by the software. The tables at higher
  56 levels contain physical addresses of the pages belonging to the lower
  57 levels. The pointer to the top level page table resides in a
  58 register. When the CPU performs the address translation, it uses this
  59 register to access the top level page table. The high bits of the
  60 virtual address are used to index an entry in the top level page
  61 table. That entry is then used to access the next level in the
  62 hierarchy with the next bits of the virtual address as the index to
  63 that level page table. The lowest bits in the virtual address define
  64 the offset inside the actual page.
  65
  66 Huge Pages
  67 ==========
  68
  69 The address translation requires several memory accesses and memory
  70 accesses are slow relatively to CPU speed. To avoid spending precious
  71 processor cycles on the address translation, CPUs maintain a cache of
  72 such translations called Translation Lookaside Buffer (or
  73 TLB). Usually TLB is pretty scarce resource and applications with
  74 large memory working set will experience performance hit because of
  75 TLB misses.
  76
  77 Many modern CPU architectures allow mapping of the memory pages
  78 directly by the higher levels in the page table. For instance, on x86,
  79 it is possible to map 2M and even 1G pages using entries in the second
  80 and the third level page tables. In Linux such pages are called
  81 `huge`. Usage of huge pages significantly reduces pressure on TLB,
  82 improves TLB hit-rate and thus improves overall system performance.
  83
  84 There are two mechanisms in Linux that enable mapping of the physical
  85 memory with the huge pages. The first one is `HugeTLB filesystem`, or
  86 hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
  87 store. For the files created in this filesystem the data resides in
  88 the memory and mapped using huge pages. The hugetlbfs is described at
  89 :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
  90
  91 Another, more recent, mechanism that enables use of the huge pages is
  92 called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
  93 requires users and/or system administrators to configure what parts of
  94 the system memory should and can be mapped by the huge pages, THP
  95 manages such mappings transparently to the user and hence the
  96 name. See
  97 :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
  98 for more details about THP.
  99
 100 Zones
 101 =====
 102
 103 Often hardware poses restrictions on how different physical memory
 104 ranges can be accessed. In some cases, devices cannot perform DMA to
 105 all the addressable memory. In other cases, the size of the physical
 106 memory exceeds the maximal addressable size of virtual memory and
 107 special actions are required to access portions of the memory. Linux
 108 groups memory pages into `zones` according to their possible
 109 usage. For example, ZONE_DMA will contain memory that can be used by
 110 devices for DMA, ZONE_HIGHMEM will contain memory that is not
 111 permanently mapped into kernel's address space and ZONE_NORMAL will
 112 contain normally addressed pages.
 113
 114 The actual layout of the memory zones is hardware dependent as not all
 115 architectures define all zones, and requirements for DMA are different
 116 for different platforms.
 117
 118 Nodes
 119 =====
 120
 121 Many multi-processor machines are NUMA - Non-Uniform Memory Access -
 122 systems. In such systems the memory is arranged into banks that have
 123 different access latency depending on the "distance" from the
 124 processor. Each bank is referred to as a `node` and for each node Linux
 125 constructs an independent memory management subsystem. A node has its
 126 own set of zones, lists of free and used pages and various statistics
 127 counters. You can find more details about NUMA in
 128 :ref:`Documentation/vm/numa.rst <numa>` and in
 129 :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
 130
 131 Page cache
 132 ==========
 133
 134 The physical memory is volatile and the common case for getting data
 135 into the memory is to read it from files. Whenever a file is read, the
 136 data is put into the `page cache` to avoid expensive disk access on
 137 the subsequent reads. Similarly, when one writes to a file, the data
 138 is placed in the page cache and eventually gets into the backing
 139 storage device. The written pages are marked as `dirty` and when Linux
 140 decides to reuse them for other purposes, it makes sure to synchronize
 141 the file contents on the device with the updated data.
 142
 143 Anonymous Memory
 144 ================
 145
 146 The `anonymous memory` or `anonymous mappings` represent memory that
 147 is not backed by a filesystem. Such mappings are implicitly created
 148 for program's stack and heap or by explicit calls to mmap(2) system
 149 call. Usually, the anonymous mappings only define virtual memory areas
 150 that the program is allowed to access. The read accesses will result
 151 in creation of a page table entry that references a special physical
 152 page filled with zeroes. When the program performs a write, a regular
 153 physical page will be allocated to hold the written data. The page
 154 will be marked dirty and if the kernel decides to repurpose it,
 155 the dirty page will be swapped out.
 156
 157 Reclaim
 158 =======
 159
 160 Throughout the system lifetime, a physical page can be used for storing
 161 different types of data. It can be kernel internal data structures,
 162 DMA'able buffers for device drivers use, data read from a filesystem,
 163 memory allocated by user space processes etc.
 164
 165 Depending on the page usage it is treated differently by the Linux
 166 memory management. The pages that can be freed at any time, either
 167 because they cache the data available elsewhere, for instance, on a
 168 hard disk, or because they can be swapped out, again, to the hard
 169 disk, are called `reclaimable`. The most notable categories of the
 170 reclaimable pages are page cache and anonymous memory.
 171
 172 In most cases, the pages holding internal kernel data and used as DMA
 173 buffers cannot be repurposed, and they remain pinned until freed by
 174 their user. Such pages are called `unreclaimable`. However, in certain
 175 circumstances, even pages occupied with kernel data structures can be
 176 reclaimed. For instance, in-memory caches of filesystem metadata can
 177 be re-read from the storage device and therefore it is possible to
 178 discard them from the main memory when system is under memory
 179 pressure.
 180
 181 The process of freeing the reclaimable physical memory pages and
 182 repurposing them is called (surprise!) `reclaim`. Linux can reclaim
 183 pages either asynchronously or synchronously, depending on the state
 184 of the system. When the system is not loaded, most of the memory is free
 185 and allocation requests will be satisfied immediately from the free
 186 pages supply. As the load increases, the amount of the free pages goes
 187 down and when it reaches a certain threshold (high watermark), an
 188 allocation request will awaken the ``kswapd`` daemon. It will
 189 asynchronously scan memory pages and either just free them if the data
 190 they contain is available elsewhere, or evict to the backing storage
 191 device (remember those dirty pages?). As memory usage increases even
 192 more and reaches another threshold - min watermark - an allocation
 193 will trigger `direct reclaim`. In this case allocation is stalled
 194 until enough memory pages are reclaimed to satisfy the request.
 195
 196 Compaction
 197 ==========
 198
 199 As the system runs, tasks allocate and free the memory and it becomes
 200 fragmented. Although with virtual memory it is possible to present
 201 scattered physical pages as virtually contiguous range, sometimes it is
 202 necessary to allocate large physically contiguous memory areas. Such
 203 need may arise, for instance, when a device driver requires a large
 204 buffer for DMA, or when THP allocates a huge page. Memory `compaction`
 205 addresses the fragmentation issue. This mechanism moves occupied pages
 206 from the lower part of a memory zone to free pages in the upper part
 207 of the zone. When a compaction scan is finished free pages are grouped
 208 together at the beginning of the zone and allocations of large
 209 physically contiguous areas become possible.
 210
 211 Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
 212 daemon or synchronously as a result of a memory allocation request.
 213
 214 OOM killer
 215 ==========
 216
 217 It is possible that on a loaded machine memory will be exhausted and the
 218 kernel will be unable to reclaim enough memory to continue to operate. In
 219 order to save the rest of the system, it invokes the `OOM killer`.
 220
 221 The `OOM killer` selects a task to sacrifice for the sake of the overall
 222 system health. The selected task is killed in a hope that after it exits
 223 enough memory will be freed to continue normal operation.