Documentation/vm/hugetlbfs_reserv.txt

   1 Hugetlbfs Reservation Overview
   2 ------------------------------
   3 Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically
   4 preallocated for application use.  These huge pages are instantiated in a
   5 task's address space at page fault time if the VMA indicates huge pages are
   6 to be used.  If no huge page exists at page fault time, the task is sent
   7 a SIGBUS and often dies an unhappy death.  Shortly after huge page support
   8 was added, it was determined that it would be better to detect a shortage
   9 of huge pages at mmap() time.  The idea is that if there were not enough
  10 huge pages to cover the mapping, the mmap() would fail.  This was first
  11 done with a simple check in the code at mmap() time to determine if there
  12 were enough free huge pages to cover the mapping.  Like most things in the
  13 kernel, the code has evolved over time.  However, the basic idea was to
  14 'reserve' huge pages at mmap() time to ensure that huge pages would be
  15 available for page faults in that mapping.  The description below attempts to
  16 describe how huge page reserve processing is done in the v4.10 kernel.
  17
  18
  19 Audience
  20 --------
  21 This description is primarily targeted at kernel developers who are modifying
  22 hugetlbfs code.
  23
  24
  25 The Data Structures
  26 -------------------
  27 resv_huge_pages
  28         This is a global (per-hstate) count of reserved huge pages.  Reserved
  29         huge pages are only available to the task which reserved them.
  30         Therefore, the number of huge pages generally available is computed
  31         as (free_huge_pages - resv_huge_pages).
  32 Reserve Map
  33         A reserve map is described by the structure:
  34         struct resv_map {
  35                 struct kref refs;
  36                 spinlock_t lock;
  37                 struct list_head regions;
  38                 long adds_in_progress;
  39                 struct list_head region_cache;
  40                 long region_cache_count;
  41         };
  42         There is one reserve map for each huge page mapping in the system.
  43         The regions list within the resv_map describes the regions within
  44         the mapping.  A region is described as:
  45         struct file_region {
  46                 struct list_head link;
  47                 long from;
  48                 long to;
  49         };
  50         The 'from' and 'to' fields of the file region structure are huge page
  51         indices into the mapping.  Depending on the type of mapping, a
  52         region in the reserv_map may indicate reservations exist for the
  53         range, or reservations do not exist.
  54 Flags for MAP_PRIVATE Reservations
  55         These are stored in the bottom bits of the reservation map pointer.
  56         #define HPAGE_RESV_OWNER    (1UL << 0) Indicates this task is the
  57                 owner of the reservations associated with the mapping.
  58         #define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally
  59                 mapping this range (and creating reserves) has unmapped a
  60                 page from this task (the child) due to a failed COW.
  61 Page Flags
  62         The PagePrivate page flag is used to indicate that a huge page
  63         reservation must be restored when the huge page is freed.  More
  64         details will be discussed in the "Freeing huge pages" section.
  65
  66
  67 Reservation Map Location (Private or Shared)
  68 --------------------------------------------
  69 A huge page mapping or segment is either private or shared.  If private,
  70 it is typically only available to a single address space (task).  If shared,
  71 it can be mapped into multiple address spaces (tasks).  The location and
  72 semantics of the reservation map is significantly different for two types
  73 of mappings.  Location differences are:
  74 - For private mappings, the reservation map hangs off the the VMA structure.
  75   Specifically, vma->vm_private_data.  This reserve map is created at the
  76   time the mapping (mmap(MAP_PRIVATE)) is created.
  77 - For shared mappings, the reservation map hangs off the inode.  Specifically,
  78   inode->i_mapping->private_data.  Since shared mappings are always backed
  79   by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
  80   contains a reservation map.  As a result, the reservation map is allocated
  81   when the inode is created.
  82
  83
  84 Creating Reservations
  85 ---------------------
  86 Reservations are created when a huge page backed shared memory segment is
  87 created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
  88 These operations result in a call to the routine hugetlb_reserve_pages()
  89
  90 int hugetlb_reserve_pages(struct inode *inode,
  91                                         long from, long to,
  92                                         struct vm_area_struct *vma,
  93                                         vm_flags_t vm_flags)
  94
  95 The first thing hugetlb_reserve_pages() does is check for the NORESERVE
  96 flag was specified in either the shmget() or mmap() call.  If NORESERVE
  97 was specified, then this routine returns immediately as no reservation
  98 are desired.
  99
 100 The arguments 'from' and 'to' are huge page indices into the mapping or
 101 underlying file.  For shmget(), 'from' is always 0 and 'to' corresponds to
 102 the length of the segment/mapping.  For mmap(), the offset argument could
 103 be used to specify the offset into the underlying file.  In such a case
 104 the 'from' and 'to' arguments have been adjusted by this offset.
 105
 106 One of the big differences between PRIVATE and SHARED mappings is the way
 107 in which reservations are represented in the reservation map.
 108 - For shared mappings, an entry in the reservation map indicates a reservation
 109   exists or did exist for the corresponding page.  As reservations are
 110   consumed, the reservation map is not modified.
 111 - For private mappings, the lack of an entry in the reservation map indicates
 112   a reservation exists for the corresponding page.  As reservations are
 113   consumed, entries are added to the reservation map.  Therefore, the
 114   reservation map can also be used to determine which reservations have
 115   been consumed.
 116
 117 For private mappings, hugetlb_reserve_pages() creates the reservation map and
 118 hangs it off the VMA structure.  In addition, the HPAGE_RESV_OWNER flag is set
 119 to indicate this VMA owns the reservations.
 120
 121 The reservation map is consulted to determine how many huge page reservations
 122 are needed for the current mapping/segment.  For private mappings, this is
 123 always the value (to - from).  However, for shared mappings it is possible that some reservations may already exist within the range (to - from).  See the
 124 section "Reservation Map Modifications" for details on how this is accomplished.
 125
 126 The mapping may be associated with a subpool.  If so, the subpool is consulted
 127 to ensure there is sufficient space for the mapping.  It is possible that the
 128 subpool has set aside reservations that can be used for the mapping.  See the
 129 section "Subpool Reservations" for more details.
 130
 131 After consulting the reservation map and subpool, the number of needed new
 132 reservations is known.  The routine hugetlb_acct_memory() is called to check
 133 for and take the requested number of reservations.  hugetlb_acct_memory()
 134 calls into routines that potentially allocate and adjust surplus page counts.
 135 However, within those routines the code is simply checking to ensure there
 136 are enough free huge pages to accommodate the reservation.  If there are,
 137 the global reservation count resv_huge_pages is adjusted something like the
 138 following.
 139         if (resv_needed <= (resv_huge_pages - free_huge_pages))
 140                 resv_huge_pages += resv_needed;
 141 Note that the global lock hugetlb_lock is held when checking and adjusting
 142 these counters.
 143
 144 If there were enough free huge pages and the global count resv_huge_pages
 145 was adjusted, then the reservation map associated with the mapping is
 146 modified to reflect the reservations.  In the case of a shared mapping, a
 147 file_region will exist that includes the range 'from' 'to'.  For private
 148 mappings, no modifications are made to the reservation map as lack of an
 149 entry indicates a reservation exists.
 150
 151 If hugetlb_reserve_pages() was successful, the global reservation count and
 152 reservation map associated with the mapping will be modified as required to
 153 ensure reservations exist for the range 'from' - 'to'.
 154
 155
 156 Consuming Reservations/Allocating a Huge Page
 157 ---------------------------------------------
 158 Reservations are consumed when huge pages associated with the reservations
 159 are allocated and instantiated in the corresponding mapping.  The allocation
 160 is performed within the routine alloc_huge_page().
 161 struct page *alloc_huge_page(struct vm_area_struct *vma,
 162                                     unsigned long addr, int avoid_reserve)
 163 alloc_huge_page is passed a VMA pointer and a virtual address, so it can
 164 consult the reservation map to determine if a reservation exists.  In addition,
 165 alloc_huge_page takes the argument avoid_reserve which indicates reserves
 166 should not be used even if it appears they have been set aside for the
 167 specified address.  The avoid_reserve argument is most often used in the case
 168 of Copy on Write and Page Migration where additional copies of an existing
 169 page are being allocated.
 170
 171 The helper routine vma_needs_reservation() is called to determine if a
 172 reservation exists for the address within the mapping(vma).  See the section
 173 "Reservation Map Helper Routines" for detailed information on what this
 174 routine does.  The value returned from vma_needs_reservation() is generally
 175 0 or 1.  0 if a reservation exists for the address, 1 if no reservation exists.
 176 If a reservation does not exist, and there is a subpool associated with the
 177 mapping the subpool is consulted to determine if it contains reservations.
 178 If the subpool contains reservations, one can be used for this allocation.
 179 However, in every case the avoid_reserve argument overrides the use of
 180 a reservation for the allocation.  After determining whether a reservation
 181 exists and can be used for the allocation, the routine dequeue_huge_page_vma()
 182 is called.  This routine takes two arguments related to reservations:
 183 - avoid_reserve, this is the same value/argument passed to alloc_huge_page()
 184 - chg, even though this argument is of type long only the values 0 or 1 are
 185   passed to dequeue_huge_page_vma.  If the value is 0, it indicates a
 186   reservation exists (see the section "Memory Policy and Reservations" for
 187   possible issues).  If the value is 1, it indicates a reservation does not
 188   exist and the page must be taken from the global free pool if possible.
 189 The free lists associated with the memory policy of the VMA are searched for
 190 a free page.  If a page is found, the value free_huge_pages is decremented
 191 when the page is removed from the free list.  If there was a reservation
 192 associated with the page, the following adjustments are made:
 193         SetPagePrivate(page);   /* Indicates allocating this page consumed
 194                                  * a reservation, and if an error is
 195                                  * encountered such that the page must be
 196                                  * freed, the reservation will be restored. */
 197         resv_huge_pages--;      /* Decrement the global reservation count */
 198 Note, if no huge page can be found that satisfies the VMA's memory policy
 199 an attempt will be made to allocate one using the buddy allocator.  This
 200 brings up the issue of surplus huge pages and overcommit which is beyond
 201 the scope reservations.  Even if a surplus page is allocated, the same
 202 reservation based adjustments as above will be made: SetPagePrivate(page) and
 203 resv_huge_pages--.
 204
 205 After obtaining a new huge page, (page)->private is set to the value of
 206 the subpool associated with the page if it exists.  This will be used for
 207 subpool accounting when the page is freed.
 208
 209 The routine vma_commit_reservation() is then called to adjust the reserve
 210 map based on the consumption of the reservation.  In general, this involves
 211 ensuring the page is represented within a file_region structure of the region
 212 map.  For shared mappings where the the reservation was present, an entry
 213 in the reserve map already existed so no change is made.  However, if there
 214 was no reservation in a shared mapping or this was a private mapping a new
 215 entry must be created.
 216
 217 It is possible that the reserve map could have been changed between the call
 218 to vma_needs_reservation() at the beginning of alloc_huge_page() and the
 219 call to vma_commit_reservation() after the page was allocated.  This would
 220 be possible if hugetlb_reserve_pages was called for the same page in a shared
 221 mapping.  In such cases, the reservation count and subpool free page count
 222 will be off by one.  This rare condition can be identified by comparing the
 223 return value from vma_needs_reservation and vma_commit_reservation.  If such
 224 a race is detected, the subpool and global reserve counts are adjusted to
 225 compensate.  See the section "Reservation Map Helper Routines" for more
 226 information on these routines.
 227
 228
 229 Instantiate Huge Pages
 230 ----------------------
 231 After huge page allocation, the page is typically added to the page tables
 232 of the allocating task.  Before this, pages in a shared mapping are added
 233 to the page cache and pages in private mappings are added to an anonymous
 234 reverse mapping.  In both cases, the PagePrivate flag is cleared.  Therefore,
 235 when a huge page that has been instantiated is freed no adjustment is made
 236 to the global reservation count (resv_huge_pages).
 237
 238
 239 Freeing Huge Pages
 240 ------------------
 241 Huge page freeing is performed by the routine free_huge_page().  This routine
 242 is the destructor for hugetlbfs compound pages.  As a result, it is only
 243 passed a pointer to the page struct.  When a huge page is freed, reservation
 244 accounting may need to be performed.  This would be the case if the page was
 245 associated with a subpool that contained reserves, or the page is being freed
 246 on an error path where a global reserve count must be restored.
 247
 248 The page->private field points to any subpool associated with the page.
 249 If the PagePrivate flag is set, it indicates the global reserve count should
 250 be adjusted (see the section "Consuming Reservations/Allocating a Huge Page"
 251 for information on how these are set).
 252
 253 The routine first calls hugepage_subpool_put_pages() for the page.  If this
 254 routine returns a value of 0 (which does not equal the value passed 1) it
 255 indicates reserves are associated with the subpool, and this newly free page
 256 must be used to keep the number of subpool reserves above the minimum size.
 257 Therefore, the global resv_huge_pages counter is incremented in this case.
 258
 259 If the PagePrivate flag was set in the page, the global resv_huge_pages counter
 260 will always be incremented.
 261
 262
 263 Subpool Reservations
 264 --------------------
 265 There is a struct hstate associated with each huge page size.  The hstate
 266 tracks all huge pages of the specified size.  A subpool represents a subset
 267 of pages within a hstate that is associated with a mounted hugetlbfs
 268 filesystem.
 269
 270 When a hugetlbfs filesystem is mounted a min_size option can be specified
 271 which indicates the minimum number of huge pages required by the filesystem.
 272 If this option is specified, the number of huge pages corresponding to
 273 min_size are reserved for use by the filesystem.  This number is tracked in
 274 the min_hpages field of a struct hugepage_subpool.  At mount time,
 275 hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
 276 huge pages.  If they can not be reserved, the mount fails.
 277
 278 The routines hugepage_subpool_get/put_pages() are called when pages are
 279 obtained from or released back to a subpool.  They perform all subpool
 280 accounting, and track any reservations associated with the subpool.
 281 hugepage_subpool_get/put_pages are passed the number of huge pages by which
 282 to adjust the subpool 'used page' count (down for get, up for put).  Normally,
 283 they return the same value that was passed or an error if not enough pages
 284 exist in the subpool.
 285
 286 However, if reserves are associated with the subpool a return value less
 287 than the passed value may be returned.  This return value indicates the
 288 number of additional global pool adjustments which must be made.  For example,
 289 suppose a subpool contains 3 reserved huge pages and someone asks for 5.
 290 The 3 reserved pages associated with the subpool can be used to satisfy part
 291 of the request.  But, 2 pages must be obtained from the global pools.  To
 292 relay this information to the caller, the value 2 is returned.  The caller
 293 is then responsible for attempting to obtain the additional two pages from
 294 the global pools.
 295
 296
 297 COW and Reservations
 298 --------------------
 299 Since shared mappings all point to and use the same underlying pages, the
 300 biggest reservation concern for COW is private mappings.  In this case,
 301 two tasks can be pointing at the same previously allocated page.  One task
 302 attempts to write to the page, so a new page must be allocated so that each
 303 task points to its own page.
 304
 305 When the page was originally allocated, the reservation for that page was
 306 consumed.  When an attempt to allocate a new page is made as a result of
 307 COW, it is possible that no free huge pages are free and the allocation
 308 will fail.
 309
 310 When the private mapping was originally created, the owner of the mapping
 311 was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation
 312 map of the owner.  Since the owner created the mapping, the owner owns all
 313 the reservations associated with the mapping.  Therefore, when a write fault
 314 occurs and there is no page available, different action is taken for the owner
 315 and non-owner of the reservation.
 316
 317 In the case where the faulting task is not the owner, the fault will fail and
 318 the task will typically receive a SIGBUS.
 319
 320 If the owner is the faulting task, we want it to succeed since it owned the
 321 original reservation.  To accomplish this, the page is unmapped from the
 322 non-owning task.  In this way, the only reference is from the owning task.
 323 In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer
 324 of the non-owning task.  The non-owning task may receive a SIGBUS if it later
 325 faults on a non-present page.  But, the original owner of the
 326 mapping/reservation will behave as expected.
 327
 328
 329 Reservation Map Modifications
 330 -----------------------------
 331 The following low level routines are used to make modifications to a
 332 reservation map.  Typically, these routines are not called directly.  Rather,
 333 a reservation map helper routine is called which calls one of these low level
 334 routines.  These low level routines are fairly well documented in the source
 335 code (mm/hugetlb.c).  These routines are:
 336 long region_chg(struct resv_map *resv, long f, long t);
 337 long region_add(struct resv_map *resv, long f, long t);
 338 void region_abort(struct resv_map *resv, long f, long t);
 339 long region_count(struct resv_map *resv, long f, long t);
 340
 341 Operations on the reservation map typically involve two operations:
 342 1) region_chg() is called to examine the reserve map and determine how
 343    many pages in the specified range [f, t) are NOT currently represented.
 344
 345    The calling code performs global checks and allocations to determine if
 346    there are enough huge pages for the operation to succeed.
 347
 348 2a) If the operation can succeed, region_add() is called to actually modify
 349     the reservation map for the same range [f, t) previously passed to
 350     region_chg().
 351 2b) If the operation can not succeed, region_abort is called for the same range
 352     [f, t) to abort the operation.
 353
 354 Note that this is a two step process where region_add() and region_abort()
 355 are guaranteed to succeed after a prior call to region_chg() for the same
 356 range.  region_chg() is responsible for pre-allocating any data structures
 357 necessary to ensure the subsequent operations (specifically region_add()))
 358 will succeed.
 359
 360 As mentioned above, region_chg() determines the number of pages in the range
 361 which are NOT currently represented in the map.  This number is returned to
 362 the caller.  region_add() returns the number of pages in the range added to
 363 the map.  In most cases, the return value of region_add() is the same as the
 364 return value of region_chg().  However, in the case of shared mappings it is
 365 possible for changes to the reservation map to be made between the calls to
 366 region_chg() and region_add().  In this case, the return value of region_add()
 367 will not match the return value of region_chg().  It is likely that in such
 368 cases global counts and subpool accounting will be incorrect and in need of
 369 adjustment.  It is the responsibility of the caller to check for this condition
 370 and make the appropriate adjustments.
 371
 372 The routine region_del() is called to remove regions from a reservation map.
 373 It is typically called in the following situations:
 374 - When a file in the hugetlbfs filesystem is being removed, the inode will
 375   be released and the reservation map freed.  Before freeing the reservation
 376   map, all the individual file_region structures must be freed.  In this case
 377   region_del is passed the range [0, LONG_MAX).
 378 - When a hugetlbfs file is being truncated.  In this case, all allocated pages
 379   after the new file size must be freed.  In addition, any file_region entries
 380   in the reservation map past the new end of file must be deleted.  In this
 381   case, region_del is passed the range [new_end_of_file, LONG_MAX).
 382 - When a hole is being punched in a hugetlbfs file.  In this case, huge pages
 383   are removed from the middle of the file one at a time.  As the pages are
 384   removed, region_del() is called to remove the corresponding entry from the
 385   reservation map.  In this case, region_del is passed the range
 386   [page_idx, page_idx + 1).
 387 In every case, region_del() will return the number of pages removed from the
 388 reservation map.  In VERY rare cases, region_del() can fail.  This can only
 389 happen in the hole punch case where it has to split an existing file_region
 390 entry and can not allocate a new structure.  In this error case, region_del()
 391 will return -ENOMEM.  The problem here is that the reservation map will
 392 indicate that there is a reservation for the page.  However, the subpool and
 393 global reservation counts will not reflect the reservation.  To handle this
 394 situation, the routine hugetlb_fix_reserve_counts() is called to adjust the
 395 counters so that they correspond with the reservation map entry that could
 396 not be deleted.
 397
 398 region_count() is called when unmapping a private huge page mapping.  In
 399 private mappings, the lack of a entry in the reservation map indicates that
 400 a reservation exists.  Therefore, by counting the number of entries in the
 401 reservation map we know how many reservations were consumed and how many are
 402 outstanding (outstanding = (end - start) - region_count(resv, start, end)).
 403 Since the mapping is going away, the subpool and global reservation counts
 404 are decremented by the number of outstanding reservations.
 405
 406
 407 Reservation Map Helper Routines
 408 -------------------------------
 409 Several helper routines exist to query and modify the reservation maps.
 410 These routines are only interested with reservations for a specific huge
 411 page, so they just pass in an address instead of a range.  In addition,
 412 they pass in the associated VMA.  From the VMA, the type of mapping (private
 413 or shared) and the location of the reservation map (inode or VMA) can be
 414 determined.  These routines simply call the underlying routines described
 415 in the section "Reservation Map Modifications".  However, they do take into
 416 account the 'opposite' meaning of reservation map entries for private and
 417 shared mappings and hide this detail from the caller.
 418
 419 long vma_needs_reservation(struct hstate *h,
 420                                 struct vm_area_struct *vma, unsigned long addr)
 421 This routine calls region_chg() for the specified page.  If no reservation
 422 exists, 1 is returned.  If a reservation exists, 0 is returned.
 423
 424 long vma_commit_reservation(struct hstate *h,
 425                                 struct vm_area_struct *vma, unsigned long addr)
 426 This calls region_add() for the specified page.  As in the case of region_chg
 427 and region_add, this routine is to be called after a previous call to
 428 vma_needs_reservation.  It will add a reservation entry for the page.  It
 429 returns 1 if the reservation was added and 0 if not.  The return value should
 430 be compared with the return value of the previous call to
 431 vma_needs_reservation.  An unexpected difference indicates the reservation
 432 map was modified between calls.
 433
 434 void vma_end_reservation(struct hstate *h,
 435                                 struct vm_area_struct *vma, unsigned long addr)
 436 This calls region_abort() for the specified page.  As in the case of region_chg
 437 and region_abort, this routine is to be called after a previous call to
 438 vma_needs_reservation.  It will abort/end the in progress reservation add
 439 operation.
 440
 441 long vma_add_reservation(struct hstate *h,
 442                                 struct vm_area_struct *vma, unsigned long addr)
 443 This is a special wrapper routine to help facilitate reservation cleanup
 444 on error paths.  It is only called from the routine restore_reserve_on_error().
 445 This routine is used in conjunction with vma_needs_reservation in an attempt
 446 to add a reservation to the reservation map.  It takes into account the
 447 different reservation map semantics for private and shared mappings.  Hence,
 448 region_add is called for shared mappings (as an entry present in the map
 449 indicates a reservation), and region_del is called for private mappings (as
 450 the absence of an entry in the map indicates a reservation).  See the section
 451 "Reservation cleanup in error paths" for more information on what needs to
 452 be done on error paths.
 453
 454
 455 Reservation Cleanup in Error Paths
 456 ----------------------------------
 457 As mentioned in the section "Reservation Map Helper Routines", reservation
 458 map modifications are performed in two steps.  First vma_needs_reservation
 459 is called before a page is allocated.  If the allocation is successful,
 460 then vma_commit_reservation is called.  If not, vma_end_reservation is called.
 461 Global and subpool reservation counts are adjusted based on success or failure
 462 of the operation and all is well.
 463
 464 Additionally, after a huge page is instantiated the PagePrivate flag is
 465 cleared so that accounting when the page is ultimately freed is correct.
 466
 467 However, there are several instances where errors are encountered after a huge
 468 page is allocated but before it is instantiated.  In this case, the page
 469 allocation has consumed the reservation and made the appropriate subpool,
 470 reservation map and global count adjustments.  If the page is freed at this
 471 time (before instantiation and clearing of PagePrivate), then free_huge_page
 472 will increment the global reservation count.  However, the reservation map
 473 indicates the reservation was consumed.  This resulting inconsistent state
 474 will cause the 'leak' of a reserved huge page.  The global reserve count will
 475 be  higher than it should and prevent allocation of a pre-allocated page.
 476
 477 The routine restore_reserve_on_error() attempts to handle this situation.  It
 478 is fairly well documented.  The intention of this routine is to restore
 479 the reservation map to the way it was before the page allocation.   In this
 480 way, the state of the reservation map will correspond to the global reservation
 481 count after the page is freed.
 482
 483 The routine restore_reserve_on_error itself may encounter errors while
 484 attempting to restore the reservation map entry.  In this case, it will
 485 simply clear the PagePrivate flag of the page.  In this way, the global
 486 reserve count will not be incremented when the page is freed.  However, the
 487 reservation map will continue to look as though the reservation was consumed.
 488 A page can still be allocated for the address, but it will not use a reserved
 489 page as originally intended.
 490
 491 There is some code (most notably userfaultfd) which can not call
 492 restore_reserve_on_error.  In this case, it simply modifies the PagePrivate
 493 so that a reservation will not be leaked when the huge page is freed.
 494
 495
 496 Reservations and Memory Policy
 497 ------------------------------
 498 Per-node huge page lists existed in struct hstate when git was first used
 499 to manage Linux code.  The concept of reservations was added some time later.
 500 When reservations were added, no attempt was made to take memory policy
 501 into account.  While cpusets are not exactly the same as memory policy, this
 502 comment in hugetlb_acct_memory sums up the interaction between reservations
 503 and cpusets/memory policy.
 504         /*
 505          * When cpuset is configured, it breaks the strict hugetlb page
 506          * reservation as the accounting is done on a global variable. Such
 507          * reservation is completely rubbish in the presence of cpuset because
 508          * the reservation is not checked against page availability for the
 509          * current cpuset. Application can still potentially OOM'ed by kernel
 510          * with lack of free htlb page in cpuset that the task is in.
 511          * Attempt to enforce strict accounting with cpuset is almost
 512          * impossible (or too ugly) because cpuset is too fluid that
 513          * task or memory node can be dynamically moved between cpusets.
 514          *
 515          * The change of semantics for shared hugetlb mapping with cpuset is
 516          * undesirable. However, in order to preserve some of the semantics,
 517          * we fall back to check against current free page availability as
 518          * a best attempt and hopefully to minimize the impact of changing
 519          * semantics that cpuset has.
 520          */
 521
 522 Huge page reservations were added to prevent unexpected page allocation
 523 failures (OOM) at page fault time.  However, if an application makes use
 524 of cpusets or memory policy there is no guarantee that huge pages will be
 525 available on the required nodes.  This is true even if there are a sufficient
 526 number of global reservations.
 527
 528
 529 Mike Kravetz, 7 April 2017