Documentation/block/bfq-iosched.txt

   1 BFQ (Budget Fair Queueing)
   2 ==========================
   3
   4 BFQ is a proportional-share I/O scheduler, with some extra
   5 low-latency capabilities. In addition to cgroups support (blkio or io
   6 controllers), BFQ's main features are:
   7 - BFQ guarantees a high system and application responsiveness, and a
   8   low latency for time-sensitive applications, such as audio or video
   9   players;
  10 - BFQ distributes bandwidth, and not just time, among processes or
  11   groups (switching back to time distribution when needed to keep
  12   throughput high).
  13
  14 In its default configuration, BFQ privileges latency over
  15 throughput. So, when needed for achieving a lower latency, BFQ builds
  16 schedules that may lead to a lower throughput. If your main or only
  17 goal, for a given device, is to achieve the maximum-possible
  18 throughput at all times, then do switch off all low-latency heuristics
  19 for that device, by setting low_latency to 0. See Section 3 for
  20 details on how to configure BFQ for the desired tradeoff between
  21 latency and throughput, or on how to maximize throughput.
  22
  23 On average CPUs, the current version of BFQ can handle devices
  24 performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
  25 reference, 30-50 KIOPS correspond to very high bandwidths with
  26 sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
  27 to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
  28 multi-queue devices too.
  29
  30 The table of contents follow. Impatients can just jump to Section 3.
  31
  32 CONTENTS
  33
  34 1. When may BFQ be useful?
  35  1-1 Personal systems
  36  1-2 Server systems
  37 2. How does BFQ work?
  38 3. What are BFQ's tunables and how to properly configure BFQ?
  39 4. BFQ group scheduling
  40  4-1 Service guarantees provided
  41  4-2 Interface
  42
  43 1. When may BFQ be useful?
  44 ==========================
  45
  46 BFQ provides the following benefits on personal and server systems.
  47
  48 1-1 Personal systems
  49 --------------------
  50
  51 Low latency for interactive applications
  52
  53 Regardless of the actual background workload, BFQ guarantees that, for
  54 interactive tasks, the storage device is virtually as responsive as if
  55 it was idle. For example, even if one or more of the following
  56 background workloads are being executed:
  57 - one or more large files are being read, written or copied,
  58 - a tree of source files is being compiled,
  59 - one or more virtual machines are performing I/O,
  60 - a software update is in progress,
  61 - indexing daemons are scanning filesystems and updating their
  62   databases,
  63 starting an application or loading a file from within an application
  64 takes about the same time as if the storage device was idle. As a
  65 comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
  66 applications experience high latencies, or even become unresponsive
  67 until the background workload terminates (also on SSDs).
  68
  69 Low latency for soft real-time applications
  70
  71 Also soft real-time applications, such as audio and video
  72 players/streamers, enjoy a low latency and a low drop rate, regardless
  73 of the background I/O workload. As a consequence, these applications
  74 do not suffer from almost any glitch due to the background workload.
  75
  76 Higher speed for code-development tasks
  77
  78 If some additional workload happens to be executed in parallel, then
  79 BFQ executes the I/O-related components of typical code-development
  80 tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
  81 NOOP or DEADLINE.
  82
  83 High throughput
  84
  85 On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
  86 up to 150% higher throughput than DEADLINE and NOOP, with all the
  87 sequential workloads considered in our tests. With random workloads,
  88 and with all the workloads on flash-based devices, BFQ achieves,
  89 instead, about the same throughput as the other schedulers.
  90
  91 Strong fairness, bandwidth and delay guarantees
  92
  93 BFQ distributes the device throughput, and not just the device time,
  94 among I/O-bound applications in proportion their weights, with any
  95 workload and regardless of the device parameters. From these bandwidth
  96 guarantees, it is possible to compute tight per-I/O-request delay
  97 guarantees by a simple formula. If not configured for strict service
  98 guarantees, BFQ switches to time-based resource sharing (only) for
  99 applications that would otherwise cause a throughput loss.
 100
 101 1-2 Server systems
 102 ------------------
 103
 104 Most benefits for server systems follow from the same service
 105 properties as above. In particular, regardless of whether additional,
 106 possibly heavy workloads are being served, BFQ guarantees:
 107
 108 . audio and video-streaming with zero or very low jitter and drop
 109   rate;
 110
 111 . fast retrieval of WEB pages and embedded objects;
 112
 113 . real-time recording of data in live-dumping applications (e.g.,
 114   packet logging);
 115
 116 . responsiveness in local and remote access to a server.
 117
 118
 119 2. How does BFQ work?
 120 =====================
 121
 122 BFQ is a proportional-share I/O scheduler, whose general structure,
 123 plus a lot of code, are borrowed from CFQ.
 124
 125 - Each process doing I/O on a device is associated with a weight and a
 126   (bfq_)queue.
 127
 128 - BFQ grants exclusive access to the device, for a while, to one queue
 129   (process) at a time, and implements this service model by
 130   associating every queue with a budget, measured in number of
 131   sectors.
 132
 133   - After a queue is granted access to the device, the budget of the
 134     queue is decremented, on each request dispatch, by the size of the
 135     request.
 136
 137   - The in-service queue is expired, i.e., its service is suspended,
 138     only if one of the following events occurs: 1) the queue finishes
 139     its budget, 2) the queue empties, 3) a "budget timeout" fires.
 140
 141     - The budget timeout prevents processes doing random I/O from
 142       holding the device for too long and dramatically reducing
 143       throughput.
 144
 145     - Actually, as in CFQ, a queue associated with a process issuing
 146       sync requests may not be expired immediately when it empties. In
 147       contrast, BFQ may idle the device for a short time interval,
 148       giving the process the chance to go on being served if it issues
 149       a new request in time. Device idling typically boosts the
 150       throughput on rotational devices and on non-queueing flash-based
 151       devices, if processes do synchronous and sequential I/O. In
 152       addition, under BFQ, device idling is also instrumental in
 153       guaranteeing the desired throughput fraction to processes
 154       issuing sync requests (see the description of the slice_idle
 155       tunable in this document, or [1, 2], for more details).
 156
 157       - With respect to idling for service guarantees, if several
 158         processes are competing for the device at the same time, but
 159         all processes and groups have the same weight, then BFQ
 160         guarantees the expected throughput distribution without ever
 161         idling the device. Throughput is thus as high as possible in
 162         this common scenario.
 163
 164      - On flash-based storage with internal queueing of commands
 165        (typically NCQ), device idling happens to be always detrimental
 166        for throughput. So, with these devices, BFQ performs idling
 167        only when strictly needed for service guarantees, i.e., for
 168        guaranteeing low latency or fairness. In these cases, overall
 169        throughput may be sub-optimal. No solution currently exists to
 170        provide both strong service guarantees and optimal throughput
 171        on devices with internal queueing.
 172
 173   - If low-latency mode is enabled (default configuration), BFQ
 174     executes some special heuristics to detect interactive and soft
 175     real-time applications (e.g., video or audio players/streamers),
 176     and to reduce their latency. The most important action taken to
 177     achieve this goal is to give to the queues associated with these
 178     applications more than their fair share of the device
 179     throughput. For brevity, we call just "weight-raising" the whole
 180     sets of actions taken by BFQ to privilege these queues. In
 181     particular, BFQ provides a milder form of weight-raising for
 182     interactive applications, and a stronger form for soft real-time
 183     applications.
 184
 185   - BFQ automatically deactivates idling for queues born in a burst of
 186     queue creations. In fact, these queues are usually associated with
 187     the processes of applications and services that benefit mostly
 188     from a high throughput. Examples are systemd during boot, or git
 189     grep.
 190
 191   - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
 192     performing random I/O that becomes mostly sequential if
 193     merged. Differently from CFQ, BFQ achieves this goal with a more
 194     reactive mechanism, called Early Queue Merge (EQM). EQM is so
 195     responsive in detecting interleaved I/O (cooperating processes),
 196     that it enables BFQ to achieve a high throughput, by queue
 197     merging, even for queues for which CFQ needs a different
 198     mechanism, preemption, to get a high throughput. As such EQM is a
 199     unified mechanism to achieve a high throughput with interleaved
 200     I/O.
 201
 202   - Queues are scheduled according to a variant of WF2Q+, named
 203     B-WF2Q+, and implemented using an augmented rb-tree to preserve an
 204     O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
 205     also ready for hierarchical scheduling, details in Section 4.
 206
 207   - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
 208     perfectly fair, and smooth service. In particular, B-WF2Q+
 209     guarantees that each queue receives a fraction of the device
 210     throughput proportional to its weight, even if the throughput
 211     fluctuates, and regardless of: the device parameters, the current
 212     workload and the budgets assigned to the queue.
 213
 214   - The last, budget-independence, property (although probably
 215     counterintuitive in the first place) is definitely beneficial, for
 216     the following reasons:
 217
 218     - First, with any proportional-share scheduler, the maximum
 219       deviation with respect to an ideal service is proportional to
 220       the maximum budget (slice) assigned to queues. As a consequence,
 221       BFQ can keep this deviation tight not only because of the
 222       accurate service of B-WF2Q+, but also because BFQ *does not*
 223       need to assign a larger budget to a queue to let the queue
 224       receive a higher fraction of the device throughput.
 225
 226     - Second, BFQ is free to choose, for every process (queue), the
 227       budget that best fits the needs of the process, or best
 228       leverages the I/O pattern of the process. In particular, BFQ
 229       updates queue budgets with a simple feedback-loop algorithm that
 230       allows a high throughput to be achieved, while still providing
 231       tight latency guarantees to time-sensitive applications. When
 232       the in-service queue expires, this algorithm computes the next
 233       budget of the queue so as to:
 234
 235       - Let large budgets be eventually assigned to the queues
 236         associated with I/O-bound applications performing sequential
 237         I/O: in fact, the longer these applications are served once
 238         got access to the device, the higher the throughput is.
 239
 240       - Let small budgets be eventually assigned to the queues
 241         associated with time-sensitive applications (which typically
 242         perform sporadic and short I/O), because, the smaller the
 243         budget assigned to a queue waiting for service is, the sooner
 244         B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
 245
 246 - If several processes are competing for the device at the same time,
 247   but all processes and groups have the same weight, then BFQ
 248   guarantees the expected throughput distribution without ever idling
 249   the device. It uses preemption instead. Throughput is then much
 250   higher in this common scenario.
 251
 252 - ioprio classes are served in strict priority order, i.e.,
 253   lower-priority queues are not served as long as there are
 254   higher-priority queues.  Among queues in the same class, the
 255   bandwidth is distributed in proportion to the weight of each
 256   queue. A very thin extra bandwidth is however guaranteed to
 257   the Idle class, to prevent it from starving.
 258
 259
 260 3. What are BFQ's tunables and how to properly configure BFQ?
 261 =============================================================
 262
 263 Most BFQ tunables affect service guarantees (basically latency and
 264 fairness) and throughput. For full details on how to choose the
 265 desired tradeoff between service guarantees and throughput, see the
 266 parameters slice_idle, strict_guarantees and low_latency. For details
 267 on how to maximise throughput, see slice_idle, timeout_sync and
 268 max_budget. The other performance-related parameters have been
 269 inherited from, and have been preserved mostly for compatibility with
 270 CFQ. So far, no performance improvement has been reported after
 271 changing the latter parameters in BFQ.
 272
 273 In particular, the tunables back_seek-max, back_seek_penalty,
 274 fifo_expire_async and fifo_expire_sync below are the same as in
 275 CFQ. Their description is just copied from that for CFQ. Some
 276 considerations in the description of slice_idle are copied from CFQ
 277 too.
 278
 279 per-process ioprio and weight
 280 -----------------------------
 281
 282 Unless the cgroups interface is used (see "4. BFQ group scheduling"),
 283 weights can be assigned to processes only indirectly, through I/O
 284 priorities, and according to the relation:
 285 weight = (IOPRIO_BE_NR - ioprio) * 10.
 286
 287 Beware that, if low-latency is set, then BFQ automatically raises the
 288 weight of the queues associated with interactive and soft real-time
 289 applications. Unset this tunable if you need/want to control weights.
 290
 291 slice_idle
 292 ----------
 293
 294 This parameter specifies how long BFQ should idle for next I/O
 295 request, when certain sync BFQ queues become empty. By default
 296 slice_idle is a non-zero value. Idling has a double purpose: boosting
 297 throughput and making sure that the desired throughput distribution is
 298 respected (see the description of how BFQ works, and, if needed, the
 299 papers referred there).
 300
 301 As for throughput, idling can be very helpful on highly seeky media
 302 like single spindle SATA/SAS disks where we can cut down on overall
 303 number of seeks and see improved throughput.
 304
 305 Setting slice_idle to 0 will remove all the idling on queues and one
 306 should see an overall improved throughput on faster storage devices
 307 like multiple SATA/SAS disks in hardware RAID configuration, as well
 308 as flash-based storage with internal command queueing (and
 309 parallelism).
 310
 311 So depending on storage and workload, it might be useful to set
 312 slice_idle=0.  In general for SATA/SAS disks and software RAID of
 313 SATA/SAS disks keeping slice_idle enabled should be useful. For any
 314 configurations where there are multiple spindles behind single LUN
 315 (Host based hardware RAID controller or for storage arrays), or with
 316 flash-based fast storage, setting slice_idle=0 might end up in better
 317 throughput and acceptable latencies.
 318
 319 Idling is however necessary to have service guarantees enforced in
 320 case of differentiated weights or differentiated I/O-request lengths.
 321 To see why, suppose that a given BFQ queue A must get several I/O
 322 requests served for each request served for another queue B. Idling
 323 ensures that, if A makes a new I/O request slightly after becoming
 324 empty, then no request of B is dispatched in the middle, and thus A
 325 does not lose the possibility to get more than one request dispatched
 326 before the next request of B is dispatched. Note that idling
 327 guarantees the desired differentiated treatment of queues only in
 328 terms of I/O-request dispatches. To guarantee that the actual service
 329 order then corresponds to the dispatch order, the strict_guarantees
 330 tunable must be set too.
 331
 332 There is an important flipside for idling: apart from the above cases
 333 where it is beneficial also for throughput, idling can severely impact
 334 throughput. One important case is random workload. Because of this
 335 issue, BFQ tends to avoid idling as much as possible, when it is not
 336 beneficial also for throughput (as detailed in Section 2). As a
 337 consequence of this behavior, and of further issues described for the
 338 strict_guarantees tunable, short-term service guarantees may be
 339 occasionally violated. And, in some cases, these guarantees may be
 340 more important than guaranteeing maximum throughput. For example, in
 341 video playing/streaming, a very low drop rate may be more important
 342 than maximum throughput. In these cases, consider setting the
 343 strict_guarantees parameter.
 344
 345 strict_guarantees
 346 -----------------
 347
 348 If this parameter is set (default: unset), then BFQ
 349
 350 - always performs idling when the in-service queue becomes empty;
 351
 352 - forces the device to serve one I/O request at a time, by dispatching a
 353   new request only if there is no outstanding request.
 354
 355 In the presence of differentiated weights or I/O-request sizes, both
 356 the above conditions are needed to guarantee that every BFQ queue
 357 receives its allotted share of the bandwidth. The first condition is
 358 needed for the reasons explained in the description of the slice_idle
 359 tunable.  The second condition is needed because all modern storage
 360 devices reorder internally-queued requests, which may trivially break
 361 the service guarantees enforced by the I/O scheduler.
 362
 363 Setting strict_guarantees may evidently affect throughput.
 364
 365 back_seek_max
 366 -------------
 367
 368 This specifies, given in Kbytes, the maximum "distance" for backward seeking.
 369 The distance is the amount of space from the current head location to the
 370 sectors that are backward in terms of distance.
 371
 372 This parameter allows the scheduler to anticipate requests in the "backward"
 373 direction and consider them as being the "next" if they are within this
 374 distance from the current head location.
 375
 376 back_seek_penalty
 377 -----------------
 378
 379 This parameter is used to compute the cost of backward seeking. If the
 380 backward distance of request is just 1/back_seek_penalty from a "front"
 381 request, then the seeking cost of two requests is considered equivalent.
 382
 383 So scheduler will not bias toward one or the other request (otherwise scheduler
 384 will bias toward front request). Default value of back_seek_penalty is 2.
 385
 386 fifo_expire_async
 387 -----------------
 388
 389 This parameter is used to set the timeout of asynchronous requests. Default
 390 value of this is 248ms.
 391
 392 fifo_expire_sync
 393 ----------------
 394
 395 This parameter is used to set the timeout of synchronous requests. Default
 396 value of this is 124ms. In case to favor synchronous requests over asynchronous
 397 one, this value should be decreased relative to fifo_expire_async.
 398
 399 low_latency
 400 -----------
 401
 402 This parameter is used to enable/disable BFQ's low latency mode. By
 403 default, low latency mode is enabled. If enabled, interactive and soft
 404 real-time applications are privileged and experience a lower latency,
 405 as explained in more detail in the description of how BFQ works.
 406
 407 DISABLE this mode if you need full control on bandwidth
 408 distribution. In fact, if it is enabled, then BFQ automatically
 409 increases the bandwidth share of privileged applications, as the main
 410 means to guarantee a lower latency to them.
 411
 412 In addition, as already highlighted at the beginning of this document,
 413 DISABLE this mode if your only goal is to achieve a high throughput.
 414 In fact, privileging the I/O of some application over the rest may
 415 entail a lower throughput. To achieve the highest-possible throughput
 416 on a non-rotational device, setting slice_idle to 0 may be needed too
 417 (at the cost of giving up any strong guarantee on fairness and low
 418 latency).
 419
 420 timeout_sync
 421 ------------
 422
 423 Maximum amount of device time that can be given to a task (queue) once
 424 it has been selected for service. On devices with costly seeks,
 425 increasing this time usually increases maximum throughput. On the
 426 opposite end, increasing this time coarsens the granularity of the
 427 short-term bandwidth and latency guarantees, especially if the
 428 following parameter is set to zero.
 429
 430 max_budget
 431 ----------
 432
 433 Maximum amount of service, measured in sectors, that can be provided
 434 to a BFQ queue once it is set in service (of course within the limits
 435 of the above timeout). According to what said in the description of
 436 the algorithm, larger values increase the throughput in proportion to
 437 the percentage of sequential I/O requests issued. The price of larger
 438 values is that they coarsen the granularity of short-term bandwidth
 439 and latency guarantees.
 440
 441 The default value is 0, which enables auto-tuning: BFQ sets max_budget
 442 to the maximum number of sectors that can be served during
 443 timeout_sync, according to the estimated peak rate.
 444
 445 For specific devices, some users have occasionally reported to have
 446 reached a higher throughput by setting max_budget explicitly, i.e., by
 447 setting max_budget to a higher value than 0. In particular, they have
 448 set max_budget to higher values than those to which BFQ would have set
 449 it with auto-tuning. An alternative way to achieve this goal is to
 450 just increase the value of timeout_sync, leaving max_budget equal to 0.
 451
 452 weights
 453 -------
 454
 455 Read-only parameter, used to show the weights of the currently active
 456 BFQ queues.
 457
 458
 459 4. Group scheduling with BFQ
 460 ============================
 461
 462 BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
 463 blkio and io. In particular, BFQ supports weight-based proportional
 464 share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
 465
 466 4-1 Service guarantees provided
 467 -------------------------------
 468
 469 With BFQ, proportional share means true proportional share of the
 470 device bandwidth, according to group weights. For example, a group
 471 with weight 200 gets twice the bandwidth, and not just twice the time,
 472 of a group with weight 100.
 473
 474 BFQ supports hierarchies (group trees) of any depth. Bandwidth is
 475 distributed among groups and processes in the expected way: for each
 476 group, the children of the group share the whole bandwidth of the
 477 group in proportion to their weights. In particular, this implies
 478 that, for each leaf group, every process of the group receives the
 479 same share of the whole group bandwidth, unless the ioprio of the
 480 process is modified.
 481
 482 The resource-sharing guarantee for a group may partially or totally
 483 switch from bandwidth to time, if providing bandwidth guarantees to
 484 the group lowers the throughput too much. This switch occurs on a
 485 per-process basis: if a process of a leaf group causes throughput loss
 486 if served in such a way to receive its share of the bandwidth, then
 487 BFQ switches back to just time-based proportional share for that
 488 process.
 489
 490 4-2 Interface
 491 -------------
 492
 493 To get proportional sharing of bandwidth with BFQ for a given device,
 494 BFQ must of course be the active scheduler for that device.
 495
 496 Within each group directory, the names of the files associated with
 497 BFQ-specific cgroup parameters and stats begin with the "bfq."
 498 prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
 499 BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
 500 parameter to set the weight of a group with BFQ is blkio.bfq.weight
 501 or io.bfq.weight.
 502
 503 Parameters to set
 504 -----------------
 505
 506 For each group, there is only the following parameter to set.
 507
 508 weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
 509 group inside its parent. Available values: 1..10000 (default 100). The
 510 linear mapping between ioprio and weights, described at the beginning
 511 of the tunable section, is still valid, but all weights higher than
 512 IOPRIO_BE_NR*10 are mapped to ioprio 0.
 513
 514 Recall that, if low-latency is set, then BFQ automatically raises the
 515 weight of the queues associated with interactive and soft real-time
 516 applications. Unset this tunable if you need/want to control weights.
 517
 518
 519 [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 520     Scheduler", Proceedings of the First Workshop on Mobile System
 521     Technologies (MST-2015), May 2015.
 522     http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
 523
 524 [2] P. Valente and M. Andreolini, "Improving Application
 525     Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
 526     the 5th Annual International Systems and Storage Conference
 527     (SYSTOR '12), June 2012.
 528     Slightly extended version:
 529     http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
 530                                                         results.pdf