Documentation/admin-guide/pm/cpufreq.rst

   1 .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
   2 .. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
   3
   4 =======================
   5 CPU Performance Scaling
   6 =======================
   7
   8 ::
   9
  10  Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  11
  12 The Concept of CPU Performance Scaling
  13 ======================================
  14
  15 The majority of modern processors are capable of operating in a number of
  16 different clock frequency and voltage configurations, often referred to as
  17 Operating Performance Points or P-states (in ACPI terminology).  As a rule,
  18 the higher the clock frequency and the higher the voltage, the more instructions
  19 can be retired by the CPU over a unit of time, but also the higher the clock
  20 frequency and the higher the voltage, the more energy is consumed over a unit of
  21 time (or the more power is drawn) by the CPU in the given P-state.  Therefore
  22 there is a natural tradeoff between the CPU capacity (the number of instructions
  23 that can be executed over a unit of time) and the power drawn by the CPU.
  24
  25 In some situations it is desirable or even necessary to run the program as fast
  26 as possible and then there is no reason to use any P-states different from the
  27 highest one (i.e. the highest-performance frequency/voltage configuration
  28 available).  In some other cases, however, it may not be necessary to execute
  29 instructions so quickly and maintaining the highest available CPU capacity for a
  30 relatively long time without utilizing it entirely may be regarded as wasteful.
  31 It also may not be physically possible to maintain maximum CPU capacity for too
  32 long for thermal or power supply capacity reasons or similar.  To cover those
  33 cases, there are hardware interfaces allowing CPUs to be switched between
  34 different frequency/voltage configurations or (in the ACPI terminology) to be
  35 put into different P-states.
  36
  37 Typically, they are used along with algorithms to estimate the required CPU
  38 capacity, so as to decide which P-states to put the CPUs into.  Of course, since
  39 the utilization of the system generally changes over time, that has to be done
  40 repeatedly on a regular basis.  The activity by which this happens is referred
  41 to as CPU performance scaling or CPU frequency scaling (because it involves
  42 adjusting the CPU clock frequency).
  43
  44
  45 CPU Performance Scaling in Linux
  46 ================================
  47
  48 The Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
  49 (CPU Frequency scaling) subsystem that consists of three layers of code: the
  50 core, scaling governors and scaling drivers.
  51
  52 The ``CPUFreq`` core provides the common code infrastructure and user space
  53 interfaces for all platforms that support CPU performance scaling.  It defines
  54 the basic framework in which the other components operate.
  55
  56 Scaling governors implement algorithms to estimate the required CPU capacity.
  57 As a rule, each governor implements one, possibly parametrized, scaling
  58 algorithm.
  59
  60 Scaling drivers talk to the hardware.  They provide scaling governors with
  61 information on the available P-states (or P-state ranges in some cases) and
  62 access platform-specific hardware interfaces to change CPU P-states as requested
  63 by scaling governors.
  64
  65 In principle, all available scaling governors can be used with every scaling
  66 driver.  That design is based on the observation that the information used by
  67 performance scaling algorithms for P-state selection can be represented in a
  68 platform-independent form in the majority of cases, so it should be possible
  69 to use the same performance scaling algorithm implemented in exactly the same
  70 way regardless of which scaling driver is used.  Consequently, the same set of
  71 scaling governors should be suitable for every supported platform.
  72
  73 However, that observation may not hold for performance scaling algorithms
  74 based on information provided by the hardware itself, for example through
  75 feedback registers, as that information is typically specific to the hardware
  76 interface it comes from and may not be easily represented in an abstract,
  77 platform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
  78 to bypass the governor layer and implement their own performance scaling
  79 algorithms.  That is done by the |intel_pstate| scaling driver.
  80
  81
  82 ``CPUFreq`` Policy Objects
  83 ==========================
  84
  85 In some cases the hardware interface for P-state control is shared by multiple
  86 CPUs.  That is, for example, the same register (or set of registers) is used to
  87 control the P-state of multiple CPUs at the same time and writing to it affects
  88 all of those CPUs simultaneously.
  89
  90 Sets of CPUs sharing hardware P-state control interfaces are represented by
  91 ``CPUFreq`` as |struct cpufreq_policy| objects.  For consistency,
  92 |struct cpufreq_policy| is also used when there is only one CPU in the given
  93 set.
  94
  95 The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for
  96 every CPU in the system, including CPUs that are currently offline.  If multiple
  97 CPUs share the same hardware P-state control interface, all of the pointers
  98 corresponding to them point to the same |struct cpufreq_policy| object.
  99
 100 ``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design
 101 of its user space interface is based on the policy concept.
 102
 103
 104 CPU Initialization
 105 ==================
 106
 107 First of all, a scaling driver has to be registered for ``CPUFreq`` to work.
 108 It is only possible to register one scaling driver at a time, so the scaling
 109 driver is expected to be able to handle all CPUs in the system.
 110
 111 The scaling driver may be registered before or after CPU registration.  If
 112 CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
 113 take a note of all of the already registered CPUs during the registration of the
 114 scaling driver.  In turn, if any CPUs are registered after the registration of
 115 the scaling driver, the ``CPUFreq`` core will be invoked to take note of them
 116 at their registration time.
 117
 118 In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
 119 has not seen so far as soon as it is ready to handle that CPU.  [Note that the
 120 logical CPU may be a physical single-core processor, or a single core in a
 121 multicore processor, or a hardware thread in a physical processor or processor
 122 core.  In what follows "CPU" always means "logical CPU" unless explicitly stated
 123 otherwise and the word "processor" is used to refer to the physical part
 124 possibly including multiple logical CPUs.]
 125
 126 Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set
 127 for the given CPU and if so, it skips the policy object creation.  Otherwise,
 128 a new policy object is created and initialized, which involves the creation of
 129 a new policy directory in ``sysfs``, and the policy pointer corresponding to
 130 the given CPU is set to the new policy object's address in memory.
 131
 132 Next, the scaling driver's ``->init()`` callback is invoked with the policy
 133 pointer of the new CPU passed to it as the argument.  That callback is expected
 134 to initialize the performance scaling hardware interface for the given CPU (or,
 135 more precisely, for the set of CPUs sharing the hardware interface it belongs
 136 to, represented by its policy object) and, if the policy object it has been
 137 called for is new, to set parameters of the policy, like the minimum and maximum
 138 frequencies supported by the hardware, the table of available frequencies (if
 139 the set of supported P-states is not a continuous range), and the mask of CPUs
 140 that belong to the same policy (including both online and offline CPUs).  That
 141 mask is then used by the core to populate the policy pointers for all of the
 142 CPUs in it.
 143
 144 The next major initialization step for a new policy object is to attach a
 145 scaling governor to it (to begin with, that is the default scaling governor
 146 determined by the kernel configuration, but it may be changed later
 147 via ``sysfs``).  First, a pointer to the new policy object is passed to the
 148 governor's ``->init()`` callback which is expected to initialize all of the
 149 data structures necessary to handle the given policy and, possibly, to add
 150 a governor ``sysfs`` interface to it.  Next, the governor is started by
 151 invoking its ``->start()`` callback.
 152
 153 That callback it expected to register per-CPU utilization update callbacks for
 154 all of the online CPUs belonging to the given policy with the CPU scheduler.
 155 The utilization update callbacks will be invoked by the CPU scheduler on
 156 important events, like task enqueue and dequeue, on every iteration of the
 157 scheduler tick or generally whenever the CPU utilization may change (from the
 158 scheduler's perspective).  They are expected to carry out computations needed
 159 to determine the P-state to use for the given policy going forward and to
 160 invoke the scaling driver to make changes to the hardware in accordance with
 161 the P-state selection.  The scaling driver may be invoked directly from
 162 scheduler context or asynchronously, via a kernel thread or workqueue, depending
 163 on the configuration and capabilities of the scaling driver and the governor.
 164
 165 Similar steps are taken for policy objects that are not new, but were "inactive"
 166 previously, meaning that all of the CPUs belonging to them were offline.  The
 167 only practical difference in that case is that the ``CPUFreq`` core will attempt
 168 to use the scaling governor previously used with the policy that became
 169 "inactive" (and is re-initialized now) instead of the default governor.
 170
 171 In turn, if a previously offline CPU is being brought back online, but some
 172 other CPUs sharing the policy object with it are online already, there is no
 173 need to re-initialize the policy object at all.  In that case, it only is
 174 necessary to restart the scaling governor so that it can take the new online CPU
 175 into account.  That is achieved by invoking the governor's ``->stop`` and
 176 ``->start()`` callbacks, in this order, for the entire policy.
 177
 178 As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
 179 governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
 180 Consequently, if |intel_pstate| is used, scaling governors are not attached to
 181 new policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
 182 to register per-CPU utilization update callbacks for each policy.  These
 183 callbacks are invoked by the CPU scheduler in the same way as for scaling
 184 governors, but in the |intel_pstate| case they both determine the P-state to
 185 use and change the hardware configuration accordingly in one go from scheduler
 186 context.
 187
 188 The policy objects created during CPU initialization and other data structures
 189 associated with them are torn down when the scaling driver is unregistered
 190 (which happens when the kernel module containing it is unloaded, for example) or
 191 when the last CPU belonging to the given policy in unregistered.
 192
 193
 194 Policy Interface in ``sysfs``
 195 =============================
 196
 197 During the initialization of the kernel, the ``CPUFreq`` core creates a
 198 ``sysfs`` directory (kobject) called ``cpufreq`` under
 199 :file:`/sys/devices/system/cpu/`.
 200
 201 That directory contains a ``policyX`` subdirectory (where ``X`` represents an
 202 integer number) for every policy object maintained by the ``CPUFreq`` core.
 203 Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
 204 under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
 205 that may be different from the one represented by ``X``) for all of the CPUs
 206 associated with (or belonging to) the given policy.  The ``policyX`` directories
 207 in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
 208 attributes (files) to control ``CPUFreq`` behavior for the corresponding policy
 209 objects (that is, for all of the CPUs associated with them).
 210
 211 Some of those attributes are generic.  They are created by the ``CPUFreq`` core
 212 and their behavior generally does not depend on what scaling driver is in use
 213 and what scaling governor is attached to the given policy.  Some scaling drivers
 214 also add driver-specific attributes to the policy directories in ``sysfs`` to
 215 control policy-specific aspects of driver behavior.
 216
 217 The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
 218 are the following:
 219
 220 ``affected_cpus``
 221         List of online CPUs belonging to this policy (i.e. sharing the hardware
 222         performance scaling interface represented by the ``policyX`` policy
 223         object).
 224
 225 ``bios_limit``
 226         If the platform firmware (BIOS) tells the OS to apply an upper limit to
 227         CPU frequencies, that limit will be reported through this attribute (if
 228         present).
 229
 230         The existence of the limit may be a result of some (often unintentional)
 231         BIOS settings, restrictions coming from a service processor or another
 232         BIOS/HW-based mechanisms.
 233
 234         This does not cover ACPI thermal limitations which can be discovered
 235         through a generic thermal driver.
 236
 237         This attribute is not present if the scaling driver in use does not
 238         support it.
 239
 240 ``cpuinfo_cur_freq``
 241         Current frequency of the CPUs belonging to this policy as obtained from
 242         the hardware (in KHz).
 243
 244         This is expected to be the frequency the hardware actually runs at.
 245         If that frequency cannot be determined, this attribute should not
 246         be present.
 247
 248 ``cpuinfo_max_freq``
 249         Maximum possible operating frequency the CPUs belonging to this policy
 250         can run at (in kHz).
 251
 252 ``cpuinfo_min_freq``
 253         Minimum possible operating frequency the CPUs belonging to this policy
 254         can run at (in kHz).
 255
 256 ``cpuinfo_transition_latency``
 257         The time it takes to switch the CPUs belonging to this policy from one
 258         P-state to another, in nanoseconds.
 259
 260         If unknown or if known to be so high that the scaling driver does not
 261         work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
 262         will be returned by reads from this attribute.
 263
 264 ``related_cpus``
 265         List of all (online and offline) CPUs belonging to this policy.
 266
 267 ``scaling_available_governors``
 268         List of ``CPUFreq`` scaling governors present in the kernel that can
 269         be attached to this policy or (if the |intel_pstate| scaling driver is
 270         in use) list of scaling algorithms provided by the driver that can be
 271         applied to this policy.
 272
 273         [Note that some governors are modular and it may be necessary to load a
 274         kernel module for the governor held by it to become available and be
 275         listed by this attribute.]
 276
 277 ``scaling_cur_freq``
 278         Current frequency of all of the CPUs belonging to this policy (in kHz).
 279
 280         In the majority of cases, this is the frequency of the last P-state
 281         requested by the scaling driver from the hardware using the scaling
 282         interface provided by it, which may or may not reflect the frequency
 283         the CPU is actually running at (due to hardware design and other
 284         limitations).
 285
 286         Some architectures (e.g. ``x86``) may attempt to provide information
 287         more precisely reflecting the current CPU frequency through this
 288         attribute, but that still may not be the exact current CPU frequency as
 289         seen by the hardware at the moment.
 290
 291 ``scaling_driver``
 292         The scaling driver currently in use.
 293
 294 ``scaling_governor``
 295         The scaling governor currently attached to this policy or (if the
 296         |intel_pstate| scaling driver is in use) the scaling algorithm
 297         provided by the driver that is currently applied to this policy.
 298
 299         This attribute is read-write and writing to it will cause a new scaling
 300         governor to be attached to this policy or a new scaling algorithm
 301         provided by the scaling driver to be applied to it (in the
 302         |intel_pstate| case), as indicated by the string written to this
 303         attribute (which must be one of the names listed by the
 304         ``scaling_available_governors`` attribute described above).
 305
 306 ``scaling_max_freq``
 307         Maximum frequency the CPUs belonging to this policy are allowed to be
 308         running at (in kHz).
 309
 310         This attribute is read-write and writing a string representing an
 311         integer to it will cause a new limit to be set (it must not be lower
 312         than the value of the ``scaling_min_freq`` attribute).
 313
 314 ``scaling_min_freq``
 315         Minimum frequency the CPUs belonging to this policy are allowed to be
 316         running at (in kHz).
 317
 318         This attribute is read-write and writing a string representing a
 319         non-negative integer to it will cause a new limit to be set (it must not
 320         be higher than the value of the ``scaling_max_freq`` attribute).
 321
 322 ``scaling_setspeed``
 323         This attribute is functional only if the `userspace`_ scaling governor
 324         is attached to the given policy.
 325
 326         It returns the last frequency requested by the governor (in kHz) or can
 327         be written to in order to set a new frequency for the policy.
 328
 329
 330 Generic Scaling Governors
 331 =========================
 332
 333 ``CPUFreq`` provides generic scaling governors that can be used with all
 334 scaling drivers.  As stated before, each of them implements a single, possibly
 335 parametrized, performance scaling algorithm.
 336
 337 Scaling governors are attached to policy objects and different policy objects
 338 can be handled by different scaling governors at the same time (although that
 339 may lead to suboptimal results in some cases).
 340
 341 The scaling governor for a given policy object can be changed at any time with
 342 the help of the ``scaling_governor`` policy attribute in ``sysfs``.
 343
 344 Some governors expose ``sysfs`` attributes to control or fine-tune the scaling
 345 algorithms implemented by them.  Those attributes, referred to as governor
 346 tunables, can be either global (system-wide) or per-policy, depending on the
 347 scaling driver in use.  If the driver requires governor tunables to be
 348 per-policy, they are located in a subdirectory of each policy directory.
 349 Otherwise, they are located in a subdirectory under
 350 :file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the
 351 subdirectory containing the governor tunables is the name of the governor
 352 providing them.
 353
 354 ``performance``
 355 ---------------
 356
 357 When attached to a policy object, this governor causes the highest frequency,
 358 within the ``scaling_max_freq`` policy limit, to be requested for that policy.
 359
 360 The request is made once at that time the governor for the policy is set to
 361 ``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
 362 policy limits change after that.
 363
 364 ``powersave``
 365 -------------
 366
 367 When attached to a policy object, this governor causes the lowest frequency,
 368 within the ``scaling_min_freq`` policy limit, to be requested for that policy.
 369
 370 The request is made once at that time the governor for the policy is set to
 371 ``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
 372 policy limits change after that.
 373
 374 ``userspace``
 375 -------------
 376
 377 This governor does not do anything by itself.  Instead, it allows user space
 378 to set the CPU frequency for the policy it is attached to by writing to the
 379 ``scaling_setspeed`` attribute of that policy.
 380
 381 ``schedutil``
 382 -------------
 383
 384 This governor uses CPU utilization data available from the CPU scheduler.  It
 385 generally is regarded as a part of the CPU scheduler, so it can access the
 386 scheduler's internal data structures directly.
 387
 388 It runs entirely in scheduler context, although in some cases it may need to
 389 invoke the scaling driver asynchronously when it decides that the CPU frequency
 390 should be changed for a given policy (that depends on whether or not the driver
 391 is capable of changing the CPU frequency from scheduler context).
 392
 393 The actions of this governor for a particular CPU depend on the scheduling class
 394 invoking its utilization update callback for that CPU.  If it is invoked by the
 395 RT or deadline scheduling classes, the governor will increase the frequency to
 396 the allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn,
 397 if it is invoked by the CFS scheduling class, the governor will use the
 398 Per-Entity Load Tracking (PELT) metric for the root control group of the
 399 given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_
 400 LWN.net article for a description of the PELT mechanism).  Then, the new
 401 CPU frequency to apply is computed in accordance with the formula
 402
 403         f = 1.25 * ``f_0`` * ``util`` / ``max``
 404
 405 where ``util`` is the PELT number, ``max`` is the theoretical maximum of
 406 ``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
 407 policy (if the PELT number is frequency-invariant), or the current CPU frequency
 408 (otherwise).
 409
 410 This governor also employs a mechanism allowing it to temporarily bump up the
 411 CPU frequency for tasks that have been waiting on I/O most recently, called
 412 "IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
 413 is passed by the scheduler to the governor callback which causes the frequency
 414 to go up to the allowed maximum immediately and then draw back to the value
 415 returned by the above formula over time.
 416
 417 This governor exposes only one tunable:
 418
 419 ``rate_limit_us``
 420         Minimum time (in microseconds) that has to pass between two consecutive
 421         runs of governor computations (default: 1000 times the scaling driver's
 422         transition latency).
 423
 424         The purpose of this tunable is to reduce the scheduler context overhead
 425         of the governor which might be excessive without it.
 426
 427 This governor generally is regarded as a replacement for the older `ondemand`_
 428 and `conservative`_ governors (described below), as it is simpler and more
 429 tightly integrated with the CPU scheduler, its overhead in terms of CPU context
 430 switches and similar is less significant, and it uses the scheduler's own CPU
 431 utilization metric, so in principle its decisions should not contradict the
 432 decisions made by the other parts of the scheduler.
 433
 434 ``ondemand``
 435 ------------
 436
 437 This governor uses CPU load as a CPU frequency selection metric.
 438
 439 In order to estimate the current CPU load, it measures the time elapsed between
 440 consecutive invocations of its worker routine and computes the fraction of that
 441 time in which the given CPU was not idle.  The ratio of the non-idle (active)
 442 time to the total CPU time is taken as an estimate of the load.
 443
 444 If this governor is attached to a policy shared by multiple CPUs, the load is
 445 estimated for all of them and the greatest result is taken as the load estimate
 446 for the entire policy.
 447
 448 The worker routine of this governor has to run in process context, so it is
 449 invoked asynchronously (via a workqueue) and CPU P-states are updated from
 450 there if necessary.  As a result, the scheduler context overhead from this
 451 governor is minimum, but it causes additional CPU context switches to happen
 452 relatively often and the CPU P-state updates triggered by it can be relatively
 453 irregular.  Also, it affects its own CPU load metric by running code that
 454 reduces the CPU idle time (even though the CPU idle time is only reduced very
 455 slightly by it).
 456
 457 It generally selects CPU frequencies proportional to the estimated load, so that
 458 the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
 459 1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
 460 corresponds to the load of 0, unless when the load exceeds a (configurable)
 461 speedup threshold, in which case it will go straight for the highest frequency
 462 it is allowed to use (the ``scaling_max_freq`` policy limit).
 463
 464 This governor exposes the following tunables:
 465
 466 ``sampling_rate``
 467         This is how often the governor's worker routine should run, in
 468         microseconds.
 469
 470         Typically, it is set to values of the order of 10000 (10 ms).  Its
 471         default value is equal to the value of ``cpuinfo_transition_latency``
 472         for each policy this governor is attached to (but since the unit here
 473         is greater by 1000, this means that the time represented by
 474         ``sampling_rate`` is 1000 times greater than the transition latency by
 475         default).
 476
 477         If this tunable is per-policy, the following shell command sets the time
 478         represented by it to be 750 times as high as the transition latency::
 479
 480         # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
 481
 482 ``up_threshold``
 483         If the estimated CPU load is above this value (in percent), the governor
 484         will set the frequency to the maximum value allowed for the policy.
 485         Otherwise, the selected frequency will be proportional to the estimated
 486         CPU load.
 487
 488 ``ignore_nice_load``
 489         If set to 1 (default 0), it will cause the CPU load estimation code to
 490         treat the CPU time spent on executing tasks with "nice" levels greater
 491         than 0 as CPU idle time.
 492
 493         This may be useful if there are tasks in the system that should not be
 494         taken into account when deciding what frequency to run the CPUs at.
 495         Then, to make that happen it is sufficient to increase the "nice" level
 496         of those tasks above 0 and set this attribute to 1.
 497
 498 ``sampling_down_factor``
 499         Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
 500         the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
 501
 502         This causes the next execution of the governor's worker routine (after
 503         setting the frequency to the allowed maximum) to be delayed, so the
 504         frequency stays at the maximum level for a longer time.
 505
 506         Frequency fluctuations in some bursty workloads may be avoided this way
 507         at the cost of additional energy spent on maintaining the maximum CPU
 508         capacity.
 509
 510 ``powersave_bias``
 511         Reduction factor to apply to the original frequency target of the
 512         governor (including the maximum value used when the ``up_threshold``
 513         value is exceeded by the estimated CPU load) or sensitivity threshold
 514         for the AMD frequency sensitivity powersave bias driver
 515         (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
 516         inclusive.
 517
 518         If the AMD frequency sensitivity powersave bias driver is not loaded,
 519         the effective frequency to apply is given by
 520
 521                 f * (1 - ``powersave_bias`` / 1000)
 522
 523         where f is the governor's original frequency target.  The default value
 524         of this attribute is 0 in that case.
 525
 526         If the AMD frequency sensitivity powersave bias driver is loaded, the
 527         value of this attribute is 400 by default and it is used in a different
 528         way.
 529
 530         On Family 16h (and later) AMD processors there is a mechanism to get a
 531         measured workload sensitivity, between 0 and 100% inclusive, from the
 532         hardware.  That value can be used to estimate how the performance of the
 533         workload running on a CPU will change in response to frequency changes.
 534
 535         The performance of a workload with the sensitivity of 0 (memory-bound or
 536         IO-bound) is not expected to increase at all as a result of increasing
 537         the CPU frequency, whereas workloads with the sensitivity of 100%
 538         (CPU-bound) are expected to perform much better if the CPU frequency is
 539         increased.
 540
 541         If the workload sensitivity is less than the threshold represented by
 542         the ``powersave_bias`` value, the sensitivity powersave bias driver
 543         will cause the governor to select a frequency lower than its original
 544         target, so as to avoid over-provisioning workloads that will not benefit
 545         from running at higher CPU frequencies.
 546
 547 ``conservative``
 548 ----------------
 549
 550 This governor uses CPU load as a CPU frequency selection metric.
 551
 552 It estimates the CPU load in the same way as the `ondemand`_ governor described
 553 above, but the CPU frequency selection algorithm implemented by it is different.
 554
 555 Namely, it avoids changing the frequency significantly over short time intervals
 556 which may not be suitable for systems with limited power supply capacity (e.g.
 557 battery-powered).  To achieve that, it changes the frequency in relatively
 558 small steps, one step at a time, up or down - depending on whether or not a
 559 (configurable) threshold has been exceeded by the estimated CPU load.
 560
 561 This governor exposes the following tunables:
 562
 563 ``freq_step``
 564         Frequency step in percent of the maximum frequency the governor is
 565         allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
 566         100 (5 by default).
 567
 568         This is how much the frequency is allowed to change in one go.  Setting
 569         it to 0 will cause the default frequency step (5 percent) to be used
 570         and setting it to 100 effectively causes the governor to periodically
 571         switch the frequency between the ``scaling_min_freq`` and
 572         ``scaling_max_freq`` policy limits.
 573
 574 ``down_threshold``
 575         Threshold value (in percent, 20 by default) used to determine the
 576         frequency change direction.
 577
 578         If the estimated CPU load is greater than this value, the frequency will
 579         go up (by ``freq_step``).  If the load is less than this value (and the
 580         ``sampling_down_factor`` mechanism is not in effect), the frequency will
 581         go down.  Otherwise, the frequency will not be changed.
 582
 583 ``sampling_down_factor``
 584         Frequency decrease deferral factor, between 1 (default) and 10
 585         inclusive.
 586
 587         It effectively causes the frequency to go down ``sampling_down_factor``
 588         times slower than it ramps up.
 589
 590
 591 Frequency Boost Support
 592 =======================
 593
 594 Background
 595 ----------
 596
 597 Some processors support a mechanism to raise the operating frequency of some
 598 cores in a multicore package temporarily (and above the sustainable frequency
 599 threshold for the whole package) under certain conditions, for example if the
 600 whole chip is not fully utilized and below its intended thermal or power budget.
 601
 602 Different names are used by different vendors to refer to this functionality.
 603 For Intel processors it is referred to as "Turbo Boost", AMD calls it
 604 "Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
 605 As a rule, it also is implemented differently by different vendors.  The simple
 606 term "frequency boost" is used here for brevity to refer to all of those
 607 implementations.
 608
 609 The frequency boost mechanism may be either hardware-based or software-based.
 610 If it is hardware-based (e.g. on x86), the decision to trigger the boosting is
 611 made by the hardware (although in general it requires the hardware to be put
 612 into a special state in which it can control the CPU frequency within certain
 613 limits).  If it is software-based (e.g. on ARM), the scaling driver decides
 614 whether or not to trigger boosting and when to do that.
 615
 616 The ``boost`` File in ``sysfs``
 617 -------------------------------
 618
 619 This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
 620 the "boost" setting for the whole system.  It is not present if the underlying
 621 scaling driver does not support the frequency boost mechanism (or supports it,
 622 but provides a driver-specific interface for controlling it, like
 623 |intel_pstate|).
 624
 625 If the value in this file is 1, the frequency boost mechanism is enabled.  This
 626 means that either the hardware can be put into states in which it is able to
 627 trigger boosting (in the hardware-based case), or the software is allowed to
 628 trigger boosting (in the software-based case).  It does not mean that boosting
 629 is actually in use at the moment on any CPUs in the system.  It only means a
 630 permission to use the frequency boost mechanism (which still may never be used
 631 for other reasons).
 632
 633 If the value in this file is 0, the frequency boost mechanism is disabled and
 634 cannot be used at all.
 635
 636 The only values that can be written to this file are 0 and 1.
 637
 638 Rationale for Boost Control Knob
 639 --------------------------------
 640
 641 The frequency boost mechanism is generally intended to help to achieve optimum
 642 CPU performance on time scales below software resolution (e.g. below the
 643 scheduler tick interval) and it is demonstrably suitable for many workloads, but
 644 it may lead to problems in certain situations.
 645
 646 For this reason, many systems make it possible to disable the frequency boost
 647 mechanism in the platform firmware (BIOS) setup, but that requires the system to
 648 be restarted for the setting to be adjusted as desired, which may not be
 649 practical at least in some cases.  For example:
 650
 651   1. Boosting means overclocking the processor, although under controlled
 652      conditions.  Generally, the processor's energy consumption increases
 653      as a result of increasing its frequency and voltage, even temporarily.
 654      That may not be desirable on systems that switch to power sources of
 655      limited capacity, such as batteries, so the ability to disable the boost
 656      mechanism while the system is running may help there (but that depends on
 657      the workload too).
 658
 659   2. In some situations deterministic behavior is more important than
 660      performance or energy consumption (or both) and the ability to disable
 661      boosting while the system is running may be useful then.
 662
 663   3. To examine the impact of the frequency boost mechanism itself, it is useful
 664      to be able to run tests with and without boosting, preferably without
 665      restarting the system in the meantime.
 666
 667   4. Reproducible results are important when running benchmarks.  Since
 668      the boosting functionality depends on the load of the whole package,
 669      single-thread performance may vary because of it which may lead to
 670      unreproducible results sometimes.  That can be avoided by disabling the
 671      frequency boost mechanism before running benchmarks sensitive to that
 672      issue.
 673
 674 Legacy AMD ``cpb`` Knob
 675 -----------------------
 676
 677 The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
 678 the global ``boost`` one.  It is used for disabling/enabling the "Core
 679 Performance Boost" feature of some AMD processors.
 680
 681 If present, that knob is located in every ``CPUFreq`` policy directory in
 682 ``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
 683 ``cpb``, which indicates a more fine grained control interface.  The actual
 684 implementation, however, works on the system-wide basis and setting that knob
 685 for one policy causes the same value of it to be set for all of the other
 686 policies at the same time.
 687
 688 That knob is still supported on AMD processors that support its underlying
 689 hardware feature, but it may be configured out of the kernel (via the
 690 :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
 691 ``boost`` knob is present regardless.  Thus it is always possible use the
 692 ``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
 693 is more consistent with what all of the other systems do (and the ``cpb`` knob
 694 may not be supported any more in the future).
 695
 696 The ``cpb`` knob is never present for any processors without the underlying
 697 hardware feature (e.g. all Intel ones), even if the
 698 :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
 699
 700
 701 .. _Per-entity load tracking: https://lwn.net/Articles/531853/