Documentation/block/bfq-iosched.txt

   1 BFQ (Budget Fair Queueing)
   2 ==========================
   3
   4 BFQ is a proportional-share I/O scheduler, with some extra
   5 low-latency capabilities. In addition to cgroups support (blkio or io
   6 controllers), BFQ's main features are:
   7 - BFQ guarantees a high system and application responsiveness, and a
   8   low latency for time-sensitive applications, such as audio or video
   9   players;
  10 - BFQ distributes bandwidth, and not just time, among processes or
  11   groups (switching back to time distribution when needed to keep
  12   throughput high).
  13
  14 In its default configuration, BFQ privileges latency over
  15 throughput. So, when needed for achieving a lower latency, BFQ builds
  16 schedules that may lead to a lower throughput. If your main or only
  17 goal, for a given device, is to achieve the maximum-possible
  18 throughput at all times, then do switch off all low-latency heuristics
  19 for that device, by setting low_latency to 0. See Section 3 for
  20 details on how to configure BFQ for the desired tradeoff between
  21 latency and throughput, or on how to maximize throughput.
  22
  23 BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
  24 can process for a device scheduled with BFQ. To give an idea of the
  25 limits on slow or average CPUs, here are, first, the limits of BFQ for
  26 three different CPUs, on, respectively, an average laptop, an old
  27 desktop, and a cheap embedded system, in case full hierarchical
  28 support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
  29 CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
  30 - Intel i7-4850HQ: 400 KIOPS
  31 - AMD A8-3850: 250 KIOPS
  32 - ARM CortexTM-A53 Octa-core: 80 KIOPS
  33
  34 If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical
  35 support is enabled), then the sustainable throughput with BFQ
  36 decreases, because all blkio.bfq* statistics are created and updated
  37 (Section 4-2). For BFQ, this leads to the following maximum
  38 sustainable throughputs, on the same systems as above:
  39 - Intel i7-4850HQ: 310 KIOPS
  40 - AMD A8-3850: 200 KIOPS
  41 - ARM CortexTM-A53 Octa-core: 56 KIOPS
  42
  43 BFQ works for multi-queue devices too.
  44
  45 The table of contents follow. Impatients can just jump to Section 3.
  46
  47 CONTENTS
  48
  49 1. When may BFQ be useful?
  50  1-1 Personal systems
  51  1-2 Server systems
  52 2. How does BFQ work?
  53 3. What are BFQ's tunables and how to properly configure BFQ?
  54 4. BFQ group scheduling
  55  4-1 Service guarantees provided
  56  4-2 Interface
  57
  58 1. When may BFQ be useful?
  59 ==========================
  60
  61 BFQ provides the following benefits on personal and server systems.
  62
  63 1-1 Personal systems
  64 --------------------
  65
  66 Low latency for interactive applications
  67
  68 Regardless of the actual background workload, BFQ guarantees that, for
  69 interactive tasks, the storage device is virtually as responsive as if
  70 it was idle. For example, even if one or more of the following
  71 background workloads are being executed:
  72 - one or more large files are being read, written or copied,
  73 - a tree of source files is being compiled,
  74 - one or more virtual machines are performing I/O,
  75 - a software update is in progress,
  76 - indexing daemons are scanning filesystems and updating their
  77   databases,
  78 starting an application or loading a file from within an application
  79 takes about the same time as if the storage device was idle. As a
  80 comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
  81 applications experience high latencies, or even become unresponsive
  82 until the background workload terminates (also on SSDs).
  83
  84 Low latency for soft real-time applications
  85
  86 Also soft real-time applications, such as audio and video
  87 players/streamers, enjoy a low latency and a low drop rate, regardless
  88 of the background I/O workload. As a consequence, these applications
  89 do not suffer from almost any glitch due to the background workload.
  90
  91 Higher speed for code-development tasks
  92
  93 If some additional workload happens to be executed in parallel, then
  94 BFQ executes the I/O-related components of typical code-development
  95 tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
  96 NOOP or DEADLINE.
  97
  98 High throughput
  99
 100 On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
 101 up to 150% higher throughput than DEADLINE and NOOP, with all the
 102 sequential workloads considered in our tests. With random workloads,
 103 and with all the workloads on flash-based devices, BFQ achieves,
 104 instead, about the same throughput as the other schedulers.
 105
 106 Strong fairness, bandwidth and delay guarantees
 107
 108 BFQ distributes the device throughput, and not just the device time,
 109 among I/O-bound applications in proportion their weights, with any
 110 workload and regardless of the device parameters. From these bandwidth
 111 guarantees, it is possible to compute tight per-I/O-request delay
 112 guarantees by a simple formula. If not configured for strict service
 113 guarantees, BFQ switches to time-based resource sharing (only) for
 114 applications that would otherwise cause a throughput loss.
 115
 116 1-2 Server systems
 117 ------------------
 118
 119 Most benefits for server systems follow from the same service
 120 properties as above. In particular, regardless of whether additional,
 121 possibly heavy workloads are being served, BFQ guarantees:
 122
 123 . audio and video-streaming with zero or very low jitter and drop
 124   rate;
 125
 126 . fast retrieval of WEB pages and embedded objects;
 127
 128 . real-time recording of data in live-dumping applications (e.g.,
 129   packet logging);
 130
 131 . responsiveness in local and remote access to a server.
 132
 133
 134 2. How does BFQ work?
 135 =====================
 136
 137 BFQ is a proportional-share I/O scheduler, whose general structure,
 138 plus a lot of code, are borrowed from CFQ.
 139
 140 - Each process doing I/O on a device is associated with a weight and a
 141   (bfq_)queue.
 142
 143 - BFQ grants exclusive access to the device, for a while, to one queue
 144   (process) at a time, and implements this service model by
 145   associating every queue with a budget, measured in number of
 146   sectors.
 147
 148   - After a queue is granted access to the device, the budget of the
 149     queue is decremented, on each request dispatch, by the size of the
 150     request.
 151
 152   - The in-service queue is expired, i.e., its service is suspended,
 153     only if one of the following events occurs: 1) the queue finishes
 154     its budget, 2) the queue empties, 3) a "budget timeout" fires.
 155
 156     - The budget timeout prevents processes doing random I/O from
 157       holding the device for too long and dramatically reducing
 158       throughput.
 159
 160     - Actually, as in CFQ, a queue associated with a process issuing
 161       sync requests may not be expired immediately when it empties. In
 162       contrast, BFQ may idle the device for a short time interval,
 163       giving the process the chance to go on being served if it issues
 164       a new request in time. Device idling typically boosts the
 165       throughput on rotational devices and on non-queueing flash-based
 166       devices, if processes do synchronous and sequential I/O. In
 167       addition, under BFQ, device idling is also instrumental in
 168       guaranteeing the desired throughput fraction to processes
 169       issuing sync requests (see the description of the slice_idle
 170       tunable in this document, or [1, 2], for more details).
 171
 172       - With respect to idling for service guarantees, if several
 173         processes are competing for the device at the same time, but
 174         all processes and groups have the same weight, then BFQ
 175         guarantees the expected throughput distribution without ever
 176         idling the device. Throughput is thus as high as possible in
 177         this common scenario.
 178
 179      - On flash-based storage with internal queueing of commands
 180        (typically NCQ), device idling happens to be always detrimental
 181        for throughput. So, with these devices, BFQ performs idling
 182        only when strictly needed for service guarantees, i.e., for
 183        guaranteeing low latency or fairness. In these cases, overall
 184        throughput may be sub-optimal. No solution currently exists to
 185        provide both strong service guarantees and optimal throughput
 186        on devices with internal queueing.
 187
 188   - If low-latency mode is enabled (default configuration), BFQ
 189     executes some special heuristics to detect interactive and soft
 190     real-time applications (e.g., video or audio players/streamers),
 191     and to reduce their latency. The most important action taken to
 192     achieve this goal is to give to the queues associated with these
 193     applications more than their fair share of the device
 194     throughput. For brevity, we call just "weight-raising" the whole
 195     sets of actions taken by BFQ to privilege these queues. In
 196     particular, BFQ provides a milder form of weight-raising for
 197     interactive applications, and a stronger form for soft real-time
 198     applications.
 199
 200   - BFQ automatically deactivates idling for queues born in a burst of
 201     queue creations. In fact, these queues are usually associated with
 202     the processes of applications and services that benefit mostly
 203     from a high throughput. Examples are systemd during boot, or git
 204     grep.
 205
 206   - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
 207     performing random I/O that becomes mostly sequential if
 208     merged. Differently from CFQ, BFQ achieves this goal with a more
 209     reactive mechanism, called Early Queue Merge (EQM). EQM is so
 210     responsive in detecting interleaved I/O (cooperating processes),
 211     that it enables BFQ to achieve a high throughput, by queue
 212     merging, even for queues for which CFQ needs a different
 213     mechanism, preemption, to get a high throughput. As such EQM is a
 214     unified mechanism to achieve a high throughput with interleaved
 215     I/O.
 216
 217   - Queues are scheduled according to a variant of WF2Q+, named
 218     B-WF2Q+, and implemented using an augmented rb-tree to preserve an
 219     O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
 220     also ready for hierarchical scheduling, details in Section 4.
 221
 222   - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
 223     perfectly fair, and smooth service. In particular, B-WF2Q+
 224     guarantees that each queue receives a fraction of the device
 225     throughput proportional to its weight, even if the throughput
 226     fluctuates, and regardless of: the device parameters, the current
 227     workload and the budgets assigned to the queue.
 228
 229   - The last, budget-independence, property (although probably
 230     counterintuitive in the first place) is definitely beneficial, for
 231     the following reasons:
 232
 233     - First, with any proportional-share scheduler, the maximum
 234       deviation with respect to an ideal service is proportional to
 235       the maximum budget (slice) assigned to queues. As a consequence,
 236       BFQ can keep this deviation tight not only because of the
 237       accurate service of B-WF2Q+, but also because BFQ *does not*
 238       need to assign a larger budget to a queue to let the queue
 239       receive a higher fraction of the device throughput.
 240
 241     - Second, BFQ is free to choose, for every process (queue), the
 242       budget that best fits the needs of the process, or best
 243       leverages the I/O pattern of the process. In particular, BFQ
 244       updates queue budgets with a simple feedback-loop algorithm that
 245       allows a high throughput to be achieved, while still providing
 246       tight latency guarantees to time-sensitive applications. When
 247       the in-service queue expires, this algorithm computes the next
 248       budget of the queue so as to:
 249
 250       - Let large budgets be eventually assigned to the queues
 251         associated with I/O-bound applications performing sequential
 252         I/O: in fact, the longer these applications are served once
 253         got access to the device, the higher the throughput is.
 254
 255       - Let small budgets be eventually assigned to the queues
 256         associated with time-sensitive applications (which typically
 257         perform sporadic and short I/O), because, the smaller the
 258         budget assigned to a queue waiting for service is, the sooner
 259         B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
 260
 261 - If several processes are competing for the device at the same time,
 262   but all processes and groups have the same weight, then BFQ
 263   guarantees the expected throughput distribution without ever idling
 264   the device. It uses preemption instead. Throughput is then much
 265   higher in this common scenario.
 266
 267 - ioprio classes are served in strict priority order, i.e.,
 268   lower-priority queues are not served as long as there are
 269   higher-priority queues.  Among queues in the same class, the
 270   bandwidth is distributed in proportion to the weight of each
 271   queue. A very thin extra bandwidth is however guaranteed to
 272   the Idle class, to prevent it from starving.
 273
 274
 275 3. What are BFQ's tunables and how to properly configure BFQ?
 276 =============================================================
 277
 278 Most BFQ tunables affect service guarantees (basically latency and
 279 fairness) and throughput. For full details on how to choose the
 280 desired tradeoff between service guarantees and throughput, see the
 281 parameters slice_idle, strict_guarantees and low_latency. For details
 282 on how to maximise throughput, see slice_idle, timeout_sync and
 283 max_budget. The other performance-related parameters have been
 284 inherited from, and have been preserved mostly for compatibility with
 285 CFQ. So far, no performance improvement has been reported after
 286 changing the latter parameters in BFQ.
 287
 288 In particular, the tunables back_seek-max, back_seek_penalty,
 289 fifo_expire_async and fifo_expire_sync below are the same as in
 290 CFQ. Their description is just copied from that for CFQ. Some
 291 considerations in the description of slice_idle are copied from CFQ
 292 too.
 293
 294 per-process ioprio and weight
 295 -----------------------------
 296
 297 Unless the cgroups interface is used (see "4. BFQ group scheduling"),
 298 weights can be assigned to processes only indirectly, through I/O
 299 priorities, and according to the relation:
 300 weight = (IOPRIO_BE_NR - ioprio) * 10.
 301
 302 Beware that, if low-latency is set, then BFQ automatically raises the
 303 weight of the queues associated with interactive and soft real-time
 304 applications. Unset this tunable if you need/want to control weights.
 305
 306 slice_idle
 307 ----------
 308
 309 This parameter specifies how long BFQ should idle for next I/O
 310 request, when certain sync BFQ queues become empty. By default
 311 slice_idle is a non-zero value. Idling has a double purpose: boosting
 312 throughput and making sure that the desired throughput distribution is
 313 respected (see the description of how BFQ works, and, if needed, the
 314 papers referred there).
 315
 316 As for throughput, idling can be very helpful on highly seeky media
 317 like single spindle SATA/SAS disks where we can cut down on overall
 318 number of seeks and see improved throughput.
 319
 320 Setting slice_idle to 0 will remove all the idling on queues and one
 321 should see an overall improved throughput on faster storage devices
 322 like multiple SATA/SAS disks in hardware RAID configuration, as well
 323 as flash-based storage with internal command queueing (and
 324 parallelism).
 325
 326 So depending on storage and workload, it might be useful to set
 327 slice_idle=0.  In general for SATA/SAS disks and software RAID of
 328 SATA/SAS disks keeping slice_idle enabled should be useful. For any
 329 configurations where there are multiple spindles behind single LUN
 330 (Host based hardware RAID controller or for storage arrays), or with
 331 flash-based fast storage, setting slice_idle=0 might end up in better
 332 throughput and acceptable latencies.
 333
 334 Idling is however necessary to have service guarantees enforced in
 335 case of differentiated weights or differentiated I/O-request lengths.
 336 To see why, suppose that a given BFQ queue A must get several I/O
 337 requests served for each request served for another queue B. Idling
 338 ensures that, if A makes a new I/O request slightly after becoming
 339 empty, then no request of B is dispatched in the middle, and thus A
 340 does not lose the possibility to get more than one request dispatched
 341 before the next request of B is dispatched. Note that idling
 342 guarantees the desired differentiated treatment of queues only in
 343 terms of I/O-request dispatches. To guarantee that the actual service
 344 order then corresponds to the dispatch order, the strict_guarantees
 345 tunable must be set too.
 346
 347 There is an important flipside for idling: apart from the above cases
 348 where it is beneficial also for throughput, idling can severely impact
 349 throughput. One important case is random workload. Because of this
 350 issue, BFQ tends to avoid idling as much as possible, when it is not
 351 beneficial also for throughput (as detailed in Section 2). As a
 352 consequence of this behavior, and of further issues described for the
 353 strict_guarantees tunable, short-term service guarantees may be
 354 occasionally violated. And, in some cases, these guarantees may be
 355 more important than guaranteeing maximum throughput. For example, in
 356 video playing/streaming, a very low drop rate may be more important
 357 than maximum throughput. In these cases, consider setting the
 358 strict_guarantees parameter.
 359
 360 strict_guarantees
 361 -----------------
 362
 363 If this parameter is set (default: unset), then BFQ
 364
 365 - always performs idling when the in-service queue becomes empty;
 366
 367 - forces the device to serve one I/O request at a time, by dispatching a
 368   new request only if there is no outstanding request.
 369
 370 In the presence of differentiated weights or I/O-request sizes, both
 371 the above conditions are needed to guarantee that every BFQ queue
 372 receives its allotted share of the bandwidth. The first condition is
 373 needed for the reasons explained in the description of the slice_idle
 374 tunable.  The second condition is needed because all modern storage
 375 devices reorder internally-queued requests, which may trivially break
 376 the service guarantees enforced by the I/O scheduler.
 377
 378 Setting strict_guarantees may evidently affect throughput.
 379
 380 back_seek_max
 381 -------------
 382
 383 This specifies, given in Kbytes, the maximum "distance" for backward seeking.
 384 The distance is the amount of space from the current head location to the
 385 sectors that are backward in terms of distance.
 386
 387 This parameter allows the scheduler to anticipate requests in the "backward"
 388 direction and consider them as being the "next" if they are within this
 389 distance from the current head location.
 390
 391 back_seek_penalty
 392 -----------------
 393
 394 This parameter is used to compute the cost of backward seeking. If the
 395 backward distance of request is just 1/back_seek_penalty from a "front"
 396 request, then the seeking cost of two requests is considered equivalent.
 397
 398 So scheduler will not bias toward one or the other request (otherwise scheduler
 399 will bias toward front request). Default value of back_seek_penalty is 2.
 400
 401 fifo_expire_async
 402 -----------------
 403
 404 This parameter is used to set the timeout of asynchronous requests. Default
 405 value of this is 248ms.
 406
 407 fifo_expire_sync
 408 ----------------
 409
 410 This parameter is used to set the timeout of synchronous requests. Default
 411 value of this is 124ms. In case to favor synchronous requests over asynchronous
 412 one, this value should be decreased relative to fifo_expire_async.
 413
 414 low_latency
 415 -----------
 416
 417 This parameter is used to enable/disable BFQ's low latency mode. By
 418 default, low latency mode is enabled. If enabled, interactive and soft
 419 real-time applications are privileged and experience a lower latency,
 420 as explained in more detail in the description of how BFQ works.
 421
 422 DISABLE this mode if you need full control on bandwidth
 423 distribution. In fact, if it is enabled, then BFQ automatically
 424 increases the bandwidth share of privileged applications, as the main
 425 means to guarantee a lower latency to them.
 426
 427 In addition, as already highlighted at the beginning of this document,
 428 DISABLE this mode if your only goal is to achieve a high throughput.
 429 In fact, privileging the I/O of some application over the rest may
 430 entail a lower throughput. To achieve the highest-possible throughput
 431 on a non-rotational device, setting slice_idle to 0 may be needed too
 432 (at the cost of giving up any strong guarantee on fairness and low
 433 latency).
 434
 435 timeout_sync
 436 ------------
 437
 438 Maximum amount of device time that can be given to a task (queue) once
 439 it has been selected for service. On devices with costly seeks,
 440 increasing this time usually increases maximum throughput. On the
 441 opposite end, increasing this time coarsens the granularity of the
 442 short-term bandwidth and latency guarantees, especially if the
 443 following parameter is set to zero.
 444
 445 max_budget
 446 ----------
 447
 448 Maximum amount of service, measured in sectors, that can be provided
 449 to a BFQ queue once it is set in service (of course within the limits
 450 of the above timeout). According to what said in the description of
 451 the algorithm, larger values increase the throughput in proportion to
 452 the percentage of sequential I/O requests issued. The price of larger
 453 values is that they coarsen the granularity of short-term bandwidth
 454 and latency guarantees.
 455
 456 The default value is 0, which enables auto-tuning: BFQ sets max_budget
 457 to the maximum number of sectors that can be served during
 458 timeout_sync, according to the estimated peak rate.
 459
 460 For specific devices, some users have occasionally reported to have
 461 reached a higher throughput by setting max_budget explicitly, i.e., by
 462 setting max_budget to a higher value than 0. In particular, they have
 463 set max_budget to higher values than those to which BFQ would have set
 464 it with auto-tuning. An alternative way to achieve this goal is to
 465 just increase the value of timeout_sync, leaving max_budget equal to 0.
 466
 467 weights
 468 -------
 469
 470 Read-only parameter, used to show the weights of the currently active
 471 BFQ queues.
 472
 473
 474 4. Group scheduling with BFQ
 475 ============================
 476
 477 BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
 478 blkio and io. In particular, BFQ supports weight-based proportional
 479 share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
 480
 481 4-1 Service guarantees provided
 482 -------------------------------
 483
 484 With BFQ, proportional share means true proportional share of the
 485 device bandwidth, according to group weights. For example, a group
 486 with weight 200 gets twice the bandwidth, and not just twice the time,
 487 of a group with weight 100.
 488
 489 BFQ supports hierarchies (group trees) of any depth. Bandwidth is
 490 distributed among groups and processes in the expected way: for each
 491 group, the children of the group share the whole bandwidth of the
 492 group in proportion to their weights. In particular, this implies
 493 that, for each leaf group, every process of the group receives the
 494 same share of the whole group bandwidth, unless the ioprio of the
 495 process is modified.
 496
 497 The resource-sharing guarantee for a group may partially or totally
 498 switch from bandwidth to time, if providing bandwidth guarantees to
 499 the group lowers the throughput too much. This switch occurs on a
 500 per-process basis: if a process of a leaf group causes throughput loss
 501 if served in such a way to receive its share of the bandwidth, then
 502 BFQ switches back to just time-based proportional share for that
 503 process.
 504
 505 4-2 Interface
 506 -------------
 507
 508 To get proportional sharing of bandwidth with BFQ for a given device,
 509 BFQ must of course be the active scheduler for that device.
 510
 511 Within each group directory, the names of the files associated with
 512 BFQ-specific cgroup parameters and stats begin with the "bfq."
 513 prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
 514 BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
 515 parameter to set the weight of a group with BFQ is blkio.bfq.weight
 516 or io.bfq.weight.
 517
 518 As for cgroups-v1 (blkio controller), the exact set of stat files
 519 created, and kept up-to-date by bfq, depends on whether
 520 CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
 521 the stat files documented in
 522 Documentation/cgroup-v1/blkio-controller.txt. If, instead,
 523 CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
 524 blkio.bfq.io_service_bytes
 525 blkio.bfq.io_service_bytes_recursive
 526 blkio.bfq.io_serviced
 527 blkio.bfq.io_serviced_recursive
 528
 529 The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum
 530 throughput sustainable with bfq, because updating the blkio.bfq.*
 531 stats is rather costly, especially for some of the stats enabled by
 532 CONFIG_DEBUG_BLK_CGROUP.
 533
 534 Parameters to set
 535 -----------------
 536
 537 For each group, there is only the following parameter to set.
 538
 539 weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
 540 group inside its parent. Available values: 1..10000 (default 100). The
 541 linear mapping between ioprio and weights, described at the beginning
 542 of the tunable section, is still valid, but all weights higher than
 543 IOPRIO_BE_NR*10 are mapped to ioprio 0.
 544
 545 Recall that, if low-latency is set, then BFQ automatically raises the
 546 weight of the queues associated with interactive and soft real-time
 547 applications. Unset this tunable if you need/want to control weights.
 548
 549
 550 [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 551     Scheduler", Proceedings of the First Workshop on Mobile System
 552     Technologies (MST-2015), May 2015.
 553     http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
 554
 555 [2] P. Valente and M. Andreolini, "Improving Application
 556     Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
 557     the 5th Annual International Systems and Storage Conference
 558     (SYSTOR '12), June 2012.
 559     Slightly extended version:
 560     http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
 561                                                         results.pdf