sched/numa: Limit the amount of virtual memory scanned in task_numa_work()
authorRik van Riel <riel@redhat.com>
Fri, 11 Sep 2015 13:00:27 +0000 (09:00 -0400)
committerIngo Molnar <mingo@kernel.org>
Fri, 18 Sep 2015 07:23:14 +0000 (09:23 +0200)
Currently task_numa_work() scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work() scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150911090027.4a7987bd@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel/sched/fair.c

index 9176f7c588a8b36ddc65deee43260a0cf9c43d46..1bfad9f39a2f0727076c243d760bd17fbebcf89d 100644 (file)
@@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
        struct vm_area_struct *vma;
        unsigned long start, end;
        unsigned long nr_pte_updates = 0;
-       long pages;
+       long pages, virtpages;
 
        WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
        start = mm->numa_scan_offset;
        pages = sysctl_numa_balancing_scan_size;
        pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+       virtpages = pages * 8;     /* Scan up to this much virtual space */
        if (!pages)
                return;
 
+
        down_read(&mm->mmap_sem);
        vma = find_vma(mm, start);
        if (!vma) {
@@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
                        start = max(start, vma->vm_start);
                        end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
                        end = min(end, vma->vm_end);
-                       nr_pte_updates += change_prot_numa(vma, start, end);
+                       nr_pte_updates = change_prot_numa(vma, start, end);
 
                        /*
-                        * Scan sysctl_numa_balancing_scan_size but ensure that
-                        * at least one PTE is updated so that unused virtual
-                        * address space is quickly skipped.
+                        * Try to scan sysctl_numa_balancing_size worth of
+                        * hpages that have at least one present PTE that
+                        * is not already pte-numa. If the VMA contains
+                        * areas that are unused or already full of prot_numa
+                        * PTEs, scan up to virtpages, to skip through those
+                        * areas faster.
                         */
                        if (nr_pte_updates)
                                pages -= (end - start) >> PAGE_SHIFT;
+                       virtpages -= (end - start) >> PAGE_SHIFT;
 
                        start = end;
-                       if (pages <= 0)
+                       if (pages <= 0 || virtpages <= 0)
                                goto out;
 
                        cond_resched();