Documentation/scheduler/sched-energy.txt

   1                            =======================
   2                            Energy Aware Scheduling
   3                            =======================
   4
   5 1. Introduction
   6 ---------------
   7
   8 Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
   9 the impact of its decisions on the energy consumed by CPUs. EAS relies on an
  10 Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
  11 with a minimal impact on throughput. This document aims at providing an
  12 introduction on how EAS works, what are the main design decisions behind it, and
  13 details what is needed to get it to run.
  14
  15 Before going any further, please note that at the time of writing:
  16
  17    /!\ EAS does not support platforms with symmetric CPU topologies /!\
  18
  19 EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
  20 because this is where the potential for saving energy through scheduling is
  21 the highest.
  22
  23 The actual EM used by EAS is _not_ maintained by the scheduler, but by a
  24 dedicated framework. For details about this framework and what it provides,
  25 please refer to its documentation (see Documentation/power/energy-model.txt).
  26
  27
  28 2. Background and Terminology
  29 -----------------------------
  30
  31 To make it clear from the start:
  32  - energy = [joule] (resource like a battery on powered devices)
  33  - power = energy/time = [joule/second] = [watt]
  34
  35 The goal of EAS is to minimize energy, while still getting the job done. That
  36 is, we want to maximize:
  37
  38         performance [inst/s]
  39         --------------------
  40             power [W]
  41
  42 which is equivalent to minimizing:
  43
  44         energy [J]
  45         -----------
  46         instruction
  47
  48 while still getting 'good' performance. It is essentially an alternative
  49 optimization objective to the current performance-only objective for the
  50 scheduler. This alternative considers two objectives: energy-efficiency and
  51 performance.
  52
  53 The idea behind introducing an EM is to allow the scheduler to evaluate the
  54 implications of its decisions rather than blindly applying energy-saving
  55 techniques that may have positive effects only on some platforms. At the same
  56 time, the EM must be as simple as possible to minimize the scheduler latency
  57 impact.
  58
  59 In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
  60 for the scheduler to decide where a task should run (during wake-up), the EM
  61 is used to break the tie between several good CPU candidates and pick the one
  62 that is predicted to yield the best energy consumption without harming the
  63 system's throughput. The predictions made by EAS rely on specific elements of
  64 knowledge about the platform's topology, which include the 'capacity' of CPUs,
  65 and their respective energy costs.
  66
  67
  68 3. Topology information
  69 -----------------------
  70
  71 EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
  72 differentiate CPUs with different computing throughput. The 'capacity' of a CPU
  73 represents the amount of work it can absorb when running at its highest
  74 frequency compared to the most capable CPU of the system. Capacity values are
  75 normalized in a 1024 range, and are comparable with the utilization signals of
  76 tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
  77 to capacity and utilization values, EAS is able to estimate how big/busy a
  78 task/CPU is, and to take this into consideration when evaluating performance vs
  79 energy trade-offs. The capacity of CPUs is provided via arch-specific code
  80 through the arch_scale_cpu_capacity() callback.
  81
  82 The rest of platform knowledge used by EAS is directly read from the Energy
  83 Model (EM) framework. The EM of a platform is composed of a power cost table
  84 per 'performance domain' in the system (see Documentation/power/energy-model.txt
  85 for futher details about performance domains).
  86
  87 The scheduler manages references to the EM objects in the topology code when the
  88 scheduling domains are built, or re-built. For each root domain (rd), the
  89 scheduler maintains a singly linked list of all performance domains intersecting
  90 the current rd->span. Each node in the list contains a pointer to a struct
  91 em_perf_domain as provided by the EM framework.
  92
  93 The lists are attached to the root domains in order to cope with exclusive
  94 cpuset configurations. Since the boundaries of exclusive cpusets do not
  95 necessarily match those of performance domains, the lists of different root
  96 domains can contain duplicate elements.
  97
  98 Example 1.
  99     Let us consider a platform with 12 CPUs, split in 3 performance domains
 100     (pd0, pd4 and pd8), organized as follows:
 101
 102                   CPUs:   0 1 2 3 4 5 6 7 8 9 10 11
 103                   PDs:   |--pd0--|--pd4--|---pd8---|
 104                   RDs:   |----rd1----|-----rd2-----|
 105
 106     Now, consider that userspace decided to split the system with two
 107     exclusive cpusets, hence creating two independent root domains, each
 108     containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
 109     above figure. Since pd4 intersects with both rd1 and rd2, it will be
 110     present in the linked list '->pd' attached to each of them:
 111        * rd1->pd: pd0 -> pd4
 112        * rd2->pd: pd4 -> pd8
 113
 114     Please note that the scheduler will create two duplicate list nodes for
 115     pd4 (one for each list). However, both just hold a pointer to the same
 116     shared data structure of the EM framework.
 117
 118 Since the access to these lists can happen concurrently with hotplug and other
 119 things, they are protected by RCU, like the rest of topology structures
 120 manipulated by the scheduler.
 121
 122 EAS also maintains a static key (sched_energy_present) which is enabled when at
 123 least one root domain meets all conditions for EAS to start. Those conditions
 124 are summarized in Section 6.
 125
 126
 127 4. Energy-Aware task placement
 128 ------------------------------
 129
 130 EAS overrides the CFS task wake-up balancing code. It uses the EM of the
 131 platform and the PELT signals to choose an energy-efficient target CPU during
 132 wake-up balance. When EAS is enabled, select_task_rq_fair() calls
 133 find_energy_efficient_cpu() to do the placement decision. This function looks
 134 for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
 135 each performance domain since it is the one which will allow us to keep the
 136 frequency the lowest. Then, the function checks if placing the task there could
 137 save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
 138 in its previous activation.
 139
 140 find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
 141 energy consumed by the system if the waking task was migrated. compute_energy()
 142 looks at the current utilization landscape of the CPUs and adjusts it to
 143 'simulate' the task migration. The EM framework provides the em_pd_energy() API
 144 which computes the expected energy consumption of each performance domain for
 145 the given utilization landscape.
 146
 147 An example of energy-optimized task placement decision is detailed below.
 148
 149 Example 2.
 150     Let us consider a (fake) platform with 2 independent performance domains
 151     composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
 152     are big.
 153
 154     The scheduler must decide where to place a task P whose util_avg = 200
 155     and prev_cpu = 0.
 156
 157     The current utilization landscape of the CPUs is depicted on the graph
 158     below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
 159     Each performance domain has three Operating Performance Points (OPPs).
 160     The CPU capacity and power cost associated with each OPP is listed in
 161     the Energy Model table. The util_avg of P is shown on the figures
 162     below as 'PP'.
 163
 164     CPU util.
 165       1024                 - - - - - - -              Energy Model
 166                                                +-----------+-------------+
 167                                                |  Little   |     Big     |
 168        768                 =============       +-----+-----+------+------+
 169                                                | Cap | Pwr | Cap  | Pwr  |
 170                                                +-----+-----+------+------+
 171        512  ===========    - ##- - - - -       | 170 | 50  | 512  | 400  |
 172                              ##     ##         | 341 | 150 | 768  | 800  |
 173        341  -PP - - - -      ##     ##         | 512 | 300 | 1024 | 1700 |
 174              PP              ##     ##         +-----+-----+------+------+
 175        170  -## - - - -      ##     ##
 176              ##     ##       ##     ##
 177            ------------    -------------
 178             CPU0   CPU1     CPU2   CPU3
 179
 180       Current OPP: =====       Other OPP: - - -     util_avg (100 each): ##
 181
 182
 183     find_energy_efficient_cpu() will first look for the CPUs with the
 184     maximum spare capacity in the two performance domains. In this example,
 185     CPU1 and CPU3. Then it will estimate the energy of the system if P was
 186     placed on either of them, and check if that would save some energy
 187     compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
 188     (which is coherent with the behaviour of the schedutil CPUFreq
 189     governor, see Section 6. for more details on this topic).
 190
 191     Case 1. P is migrated to CPU1
 192     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 193
 194       1024                 - - - - - - -
 195
 196                                             Energy calculation:
 197        768                 =============     * CPU0: 200 / 341 * 150 = 88
 198                                              * CPU1: 300 / 341 * 150 = 131
 199                                              * CPU2: 600 / 768 * 800 = 625
 200        512  - - - - - -    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
 201                              ##     ##          => total_energy = 1364
 202        341  ===========      ##     ##
 203                     PP       ##     ##
 204        170  -## - - PP-      ##     ##
 205              ##     ##       ##     ##
 206            ------------    -------------
 207             CPU0   CPU1     CPU2   CPU3
 208
 209
 210     Case 2. P is migrated to CPU3
 211     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 212
 213       1024                 - - - - - - -
 214
 215                                             Energy calculation:
 216        768                 =============     * CPU0: 200 / 341 * 150 = 88
 217                                              * CPU1: 100 / 341 * 150 = 43
 218                                     PP       * CPU2: 600 / 768 * 800 = 625
 219        512  - - - - - -    - ##- - -PP -     * CPU3: 700 / 768 * 800 = 729
 220                              ##     ##          => total_energy = 1485
 221        341  ===========      ##     ##
 222                              ##     ##
 223        170  -## - - - -      ##     ##
 224              ##     ##       ##     ##
 225            ------------    -------------
 226             CPU0   CPU1     CPU2   CPU3
 227
 228
 229     Case 3. P stays on prev_cpu / CPU 0
 230     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 231
 232       1024                 - - - - - - -
 233
 234                                             Energy calculation:
 235        768                 =============     * CPU0: 400 / 512 * 300 = 234
 236                                              * CPU1: 100 / 512 * 300 = 58
 237                                              * CPU2: 600 / 768 * 800 = 625
 238        512  ===========    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
 239                              ##     ##          => total_energy = 1437
 240        341  -PP - - - -      ##     ##
 241              PP              ##     ##
 242        170  -## - - - -      ##     ##
 243              ##     ##       ##     ##
 244            ------------    -------------
 245             CPU0   CPU1     CPU2   CPU3
 246
 247
 248     From these calculations, the Case 1 has the lowest total energy. So CPU 1
 249     is be the best candidate from an energy-efficiency standpoint.
 250
 251 Big CPUs are generally more power hungry than the little ones and are thus used
 252 mainly when a task doesn't fit the littles. However, little CPUs aren't always
 253 necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
 254 of the little CPUs can be less energy-efficient than the lowest OPPs of the
 255 bigs, for example. So, if the little CPUs happen to have enough utilization at
 256 a specific point in time, a small task waking up at that moment could be better
 257 of executing on the big side in order to save energy, even though it would fit
 258 on the little side.
 259
 260 And even in the case where all OPPs of the big CPUs are less energy-efficient
 261 than those of the little, using the big CPUs for a small task might still, under
 262 specific conditions, save energy. Indeed, placing a task on a little CPU can
 263 result in raising the OPP of the entire performance domain, and that will
 264 increase the cost of the tasks already running there. If the waking task is
 265 placed on a big CPU, its own execution cost might be higher than if it was
 266 running on a little, but it won't impact the other tasks of the little CPUs
 267 which will keep running at a lower OPP. So, when considering the total energy
 268 consumed by CPUs, the extra cost of running that one task on a big core can be
 269 smaller than the cost of raising the OPP on the little CPUs for all the other
 270 tasks.
 271
 272 The examples above would be nearly impossible to get right in a generic way, and
 273 for all platforms, without knowing the cost of running at different OPPs on all
 274 CPUs of the system. Thanks to its EM-based design, EAS should cope with them
 275 correctly without too many troubles. However, in order to ensure a minimal
 276 impact on throughput for high-utilization scenarios, EAS also implements another
 277 mechanism called 'over-utilization'.
 278
 279
 280 5. Over-utilization
 281 -------------------
 282
 283 From a general standpoint, the use-cases where EAS can help the most are those
 284 involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
 285 being run, they will require all of the available CPU capacity, and there isn't
 286 much that can be done by the scheduler to save energy without severly harming
 287 throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
 288 'over-utilized' as soon as they are used at more than 80% of their compute
 289 capacity. As long as no CPUs are over-utilized in a root domain, load balancing
 290 is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
 291 the most energy efficient CPUs of the system more than the others if that can be
 292 done without harming throughput. So, the load-balancer is disabled to prevent
 293 it from breaking the energy-efficient task placement found by EAS. It is safe to
 294 do so when the system isn't overutilized since being below the 80% tipping point
 295 implies that:
 296
 297     a. there is some idle time on all CPUs, so the utilization signals used by
 298        EAS are likely to accurately represent the 'size' of the various tasks
 299        in the system;
 300     b. all tasks should already be provided with enough CPU capacity,
 301        regardless of their nice values;
 302     c. since there is spare capacity all tasks must be blocking/sleeping
 303        regularly and balancing at wake-up is sufficient.
 304
 305 As soon as one CPU goes above the 80% tipping point, at least one of the three
 306 assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
 307 is raised for the entire root domain, EAS is disabled, and the load-balancer is
 308 re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
 309 wake-up and load balance under CPU-bound conditions. This provides a better
 310 respect of the nice values of tasks.
 311
 312 Since the notion of overutilization largely relies on detecting whether or not
 313 there is some idle time in the system, the CPU capacity 'stolen' by higher
 314 (than CFS) scheduling classes (as well as IRQ) must be taken into account. As
 315 such, the detection of overutilization accounts for the capacity used not only
 316 by CFS tasks, but also by the other scheduling classes and IRQ.
 317
 318
 319 6. Dependencies and requirements for EAS
 320 ----------------------------------------
 321
 322 Energy Aware Scheduling depends on the CPUs of the system having specific
 323 hardware properties and on other features of the kernel being enabled. This
 324 section lists these dependencies and provides hints as to how they can be met.
 325
 326
 327   6.1 - Asymmetric CPU topology
 328
 329 As mentioned in the introduction, EAS is only supported on platforms with
 330 asymmetric CPU topologies for now. This requirement is checked at run-time by
 331 looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
 332 domains are built.
 333
 334 The flag is set/cleared automatically by the scheduler topology code whenever
 335 there are CPUs with different capacities in a root domain. The capacities of
 336 CPUs are provided by arch-specific code through the arch_scale_cpu_capacity()
 337 callback. As an example, arm and arm64 share an implementation of this callback
 338 which uses a combination of CPUFreq data and device-tree bindings to compute the
 339 capacity of CPUs (see drivers/base/arch_topology.c for more details).
 340
 341 So, in order to use EAS on your platform your architecture must implement the
 342 arch_scale_cpu_capacity() callback, and some of the CPUs must have a lower
 343 capacity than others.
 344
 345 Please note that EAS is not fundamentally incompatible with SMP, but no
 346 significant savings on SMP platforms have been observed yet. This restriction
 347 could be amended in the future if proven otherwise.
 348
 349
 350   6.2 - Energy Model presence
 351
 352 EAS uses the EM of a platform to estimate the impact of scheduling decisions on
 353 energy. So, your platform must provide power cost tables to the EM framework in
 354 order to make EAS start. To do so, please refer to documentation of the
 355 independent EM framework in Documentation/power/energy-model.txt.
 356
 357 Please also note that the scheduling domains need to be re-built after the
 358 EM has been registered in order to start EAS.
 359
 360
 361   6.3 - Energy Model complexity
 362
 363 The task wake-up path is very latency-sensitive. When the EM of a platform is
 364 too complex (too many CPUs, too many performance domains, too many performance
 365 states, ...), the cost of using it in the wake-up path can become prohibitive.
 366 The energy-aware wake-up algorithm has a complexity of:
 367
 368         C = Nd * (Nc + Ns)
 369
 370 with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
 371 total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
 372
 373 A complexity check is performed at the root domain level, when scheduling
 374 domains are built. EAS will not start on a root domain if its C happens to be
 375 higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
 376 time of writing).
 377
 378 If you really want to use EAS but the complexity of your platform's Energy
 379 Model is too high to be used with a single root domain, you're left with only
 380 two possible options:
 381
 382     1. split your system into separate, smaller, root domains using exclusive
 383        cpusets and enable EAS locally on each of them. This option has the
 384        benefit to work out of the box but the drawback of preventing load
 385        balance between root domains, which can result in an unbalanced system
 386        overall;
 387     2. submit patches to reduce the complexity of the EAS wake-up algorithm,
 388        hence enabling it to cope with larger EMs in reasonable time.
 389
 390
 391   6.4 - Schedutil governor
 392
 393 EAS tries to predict at which OPP will the CPUs be running in the close future
 394 in order to estimate their energy consumption. To do so, it is assumed that OPPs
 395 of CPUs follow their utilization.
 396
 397 Although it is very difficult to provide hard guarantees regarding the accuracy
 398 of this assumption in practice (because the hardware might not do what it is
 399 told to do, for example), schedutil as opposed to other CPUFreq governors at
 400 least _requests_ frequencies calculated using the utilization signals.
 401 Consequently, the only sane governor to use together with EAS is schedutil,
 402 because it is the only one providing some degree of consistency between
 403 frequency requests and energy predictions.
 404
 405 Using EAS with any other governor than schedutil is not supported.
 406
 407
 408   6.5 Scale-invariant utilization signals
 409
 410 In order to make accurate prediction across CPUs and for all performance
 411 states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
 412 be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
 413 callbacks.
 414
 415 Using EAS on a platform that doesn't implement these two callbacks is not
 416 supported.
 417
 418
 419   6.6 Multithreading (SMT)
 420
 421 EAS in its current form is SMT unaware and is not able to leverage
 422 multithreaded hardware to save energy. EAS considers threads as independent
 423 CPUs, which can actually be counter-productive for both performance and energy.
 424
 425 EAS on SMT is not supported.