Documentation/block/bfq-iosched.txt

   1 BFQ (Budget Fair Queueing)
   2 ==========================
   3
   4 BFQ is a proportional-share I/O scheduler, with some extra
   5 low-latency capabilities. In addition to cgroups support (blkio or io
   6 controllers), BFQ's main features are:
   7 - BFQ guarantees a high system and application responsiveness, and a
   8   low latency for time-sensitive applications, such as audio or video
   9   players;
  10 - BFQ distributes bandwidth, and not just time, among processes or
  11   groups (switching back to time distribution when needed to keep
  12   throughput high).
  13
  14 In its default configuration, BFQ privileges latency over
  15 throughput. So, when needed for achieving a lower latency, BFQ builds
  16 schedules that may lead to a lower throughput. If your main or only
  17 goal, for a given device, is to achieve the maximum-possible
  18 throughput at all times, then do switch off all low-latency heuristics
  19 for that device, by setting low_latency to 0. Full details in Section 3.
  20
  21 On average CPUs, the current version of BFQ can handle devices
  22 performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
  23 reference, 30-50 KIOPS correspond to very high bandwidths with
  24 sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
  25 to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
  26 multi-queue devices.
  27
  28 The table of contents follow. Impatients can just jump to Section 3.
  29
  30 CONTENTS
  31
  32 1. When may BFQ be useful?
  33  1-1 Personal systems
  34  1-2 Server systems
  35 2. How does BFQ work?
  36 3. What are BFQ's tunable?
  37 4. BFQ group scheduling
  38  4-1 Service guarantees provided
  39  4-2 Interface
  40
  41 1. When may BFQ be useful?
  42 ==========================
  43
  44 BFQ provides the following benefits on personal and server systems.
  45
  46 1-1 Personal systems
  47 --------------------
  48
  49 Low latency for interactive applications
  50
  51 Regardless of the actual background workload, BFQ guarantees that, for
  52 interactive tasks, the storage device is virtually as responsive as if
  53 it was idle. For example, even if one or more of the following
  54 background workloads are being executed:
  55 - one or more large files are being read, written or copied,
  56 - a tree of source files is being compiled,
  57 - one or more virtual machines are performing I/O,
  58 - a software update is in progress,
  59 - indexing daemons are scanning filesystems and updating their
  60   databases,
  61 starting an application or loading a file from within an application
  62 takes about the same time as if the storage device was idle. As a
  63 comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
  64 applications experience high latencies, or even become unresponsive
  65 until the background workload terminates (also on SSDs).
  66
  67 Low latency for soft real-time applications
  68
  69 Also soft real-time applications, such as audio and video
  70 players/streamers, enjoy a low latency and a low drop rate, regardless
  71 of the background I/O workload. As a consequence, these applications
  72 do not suffer from almost any glitch due to the background workload.
  73
  74 Higher speed for code-development tasks
  75
  76 If some additional workload happens to be executed in parallel, then
  77 BFQ executes the I/O-related components of typical code-development
  78 tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
  79 NOOP or DEADLINE.
  80
  81 High throughput
  82
  83 On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
  84 up to 150% higher throughput than DEADLINE and NOOP, with all the
  85 sequential workloads considered in our tests. With random workloads,
  86 and with all the workloads on flash-based devices, BFQ achieves,
  87 instead, about the same throughput as the other schedulers.
  88
  89 Strong fairness, bandwidth and delay guarantees
  90
  91 BFQ distributes the device throughput, and not just the device time,
  92 among I/O-bound applications in proportion their weights, with any
  93 workload and regardless of the device parameters. From these bandwidth
  94 guarantees, it is possible to compute tight per-I/O-request delay
  95 guarantees by a simple formula. If not configured for strict service
  96 guarantees, BFQ switches to time-based resource sharing (only) for
  97 applications that would otherwise cause a throughput loss.
  98
  99 1-2 Server systems
 100 ------------------
 101
 102 Most benefits for server systems follow from the same service
 103 properties as above. In particular, regardless of whether additional,
 104 possibly heavy workloads are being served, BFQ guarantees:
 105
 106 . audio and video-streaming with zero or very low jitter and drop
 107   rate;
 108
 109 . fast retrieval of WEB pages and embedded objects;
 110
 111 . real-time recording of data in live-dumping applications (e.g.,
 112   packet logging);
 113
 114 . responsiveness in local and remote access to a server.
 115
 116
 117 2. How does BFQ work?
 118 =====================
 119
 120 BFQ is a proportional-share I/O scheduler, whose general structure,
 121 plus a lot of code, are borrowed from CFQ.
 122
 123 - Each process doing I/O on a device is associated with a weight and a
 124   (bfq_)queue.
 125
 126 - BFQ grants exclusive access to the device, for a while, to one queue
 127   (process) at a time, and implements this service model by
 128   associating every queue with a budget, measured in number of
 129   sectors.
 130
 131   - After a queue is granted access to the device, the budget of the
 132     queue is decremented, on each request dispatch, by the size of the
 133     request.
 134
 135   - The in-service queue is expired, i.e., its service is suspended,
 136     only if one of the following events occurs: 1) the queue finishes
 137     its budget, 2) the queue empties, 3) a "budget timeout" fires.
 138
 139     - The budget timeout prevents processes doing random I/O from
 140       holding the device for too long and dramatically reducing
 141       throughput.
 142
 143     - Actually, as in CFQ, a queue associated with a process issuing
 144       sync requests may not be expired immediately when it empties. In
 145       contrast, BFQ may idle the device for a short time interval,
 146       giving the process the chance to go on being served if it issues
 147       a new request in time. Device idling typically boosts the
 148       throughput on rotational devices, if processes do synchronous
 149       and sequential I/O. In addition, under BFQ, device idling is
 150       also instrumental in guaranteeing the desired throughput
 151       fraction to processes issuing sync requests (see the description
 152       of the slice_idle tunable in this document, or [1, 2], for more
 153       details).
 154
 155       - With respect to idling for service guarantees, if several
 156         processes are competing for the device at the same time, but
 157         all processes (and groups, after the following commit) have
 158         the same weight, then BFQ guarantees the expected throughput
 159         distribution without ever idling the device. Throughput is
 160         thus as high as possible in this common scenario.
 161
 162   - If low-latency mode is enabled (default configuration), BFQ
 163     executes some special heuristics to detect interactive and soft
 164     real-time applications (e.g., video or audio players/streamers),
 165     and to reduce their latency. The most important action taken to
 166     achieve this goal is to give to the queues associated with these
 167     applications more than their fair share of the device
 168     throughput. For brevity, we call just "weight-raising" the whole
 169     sets of actions taken by BFQ to privilege these queues. In
 170     particular, BFQ provides a milder form of weight-raising for
 171     interactive applications, and a stronger form for soft real-time
 172     applications.
 173
 174   - BFQ automatically deactivates idling for queues born in a burst of
 175     queue creations. In fact, these queues are usually associated with
 176     the processes of applications and services that benefit mostly
 177     from a high throughput. Examples are systemd during boot, or git
 178     grep.
 179
 180   - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
 181     performing random I/O that becomes mostly sequential if
 182     merged. Differently from CFQ, BFQ achieves this goal with a more
 183     reactive mechanism, called Early Queue Merge (EQM). EQM is so
 184     responsive in detecting interleaved I/O (cooperating processes),
 185     that it enables BFQ to achieve a high throughput, by queue
 186     merging, even for queues for which CFQ needs a different
 187     mechanism, preemption, to get a high throughput. As such EQM is a
 188     unified mechanism to achieve a high throughput with interleaved
 189     I/O.
 190
 191   - Queues are scheduled according to a variant of WF2Q+, named
 192     B-WF2Q+, and implemented using an augmented rb-tree to preserve an
 193     O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
 194     also ready for hierarchical scheduling. However, for a cleaner
 195     logical breakdown, the code that enables and completes
 196     hierarchical support is provided in the next commit, which focuses
 197     exactly on this feature.
 198
 199   - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
 200     perfectly fair, and smooth service. In particular, B-WF2Q+
 201     guarantees that each queue receives a fraction of the device
 202     throughput proportional to its weight, even if the throughput
 203     fluctuates, and regardless of: the device parameters, the current
 204     workload and the budgets assigned to the queue.
 205
 206   - The last, budget-independence, property (although probably
 207     counterintuitive in the first place) is definitely beneficial, for
 208     the following reasons:
 209
 210     - First, with any proportional-share scheduler, the maximum
 211       deviation with respect to an ideal service is proportional to
 212       the maximum budget (slice) assigned to queues. As a consequence,
 213       BFQ can keep this deviation tight not only because of the
 214       accurate service of B-WF2Q+, but also because BFQ *does not*
 215       need to assign a larger budget to a queue to let the queue
 216       receive a higher fraction of the device throughput.
 217
 218     - Second, BFQ is free to choose, for every process (queue), the
 219       budget that best fits the needs of the process, or best
 220       leverages the I/O pattern of the process. In particular, BFQ
 221       updates queue budgets with a simple feedback-loop algorithm that
 222       allows a high throughput to be achieved, while still providing
 223       tight latency guarantees to time-sensitive applications. When
 224       the in-service queue expires, this algorithm computes the next
 225       budget of the queue so as to:
 226
 227       - Let large budgets be eventually assigned to the queues
 228         associated with I/O-bound applications performing sequential
 229         I/O: in fact, the longer these applications are served once
 230         got access to the device, the higher the throughput is.
 231
 232       - Let small budgets be eventually assigned to the queues
 233         associated with time-sensitive applications (which typically
 234         perform sporadic and short I/O), because, the smaller the
 235         budget assigned to a queue waiting for service is, the sooner
 236         B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
 237
 238 - If several processes are competing for the device at the same time,
 239   but all processes and groups have the same weight, then BFQ
 240   guarantees the expected throughput distribution without ever idling
 241   the device. It uses preemption instead. Throughput is then much
 242   higher in this common scenario.
 243
 244 - ioprio classes are served in strict priority order, i.e.,
 245   lower-priority queues are not served as long as there are
 246   higher-priority queues.  Among queues in the same class, the
 247   bandwidth is distributed in proportion to the weight of each
 248   queue. A very thin extra bandwidth is however guaranteed to
 249   the Idle class, to prevent it from starving.
 250
 251
 252 3. What are BFQ's tunable?
 253 ==========================
 254
 255 The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
 256 fifo_expire_sync below are the same as in CFQ. Their description is
 257 just copied from that for CFQ. Some considerations in the description
 258 of slice_idle are copied from CFQ too.
 259
 260 per-process ioprio and weight
 261 -----------------------------
 262
 263 Unless the cgroups interface is used (see "4. BFQ group scheduling"),
 264 weights can be assigned to processes only indirectly, through I/O
 265 priorities, and according to the relation:
 266 weight = (IOPRIO_BE_NR - ioprio) * 10.
 267
 268 Beware that, if low-latency is set, then BFQ automatically raises the
 269 weight of the queues associated with interactive and soft real-time
 270 applications. Unset this tunable if you need/want to control weights.
 271
 272 slice_idle
 273 ----------
 274
 275 This parameter specifies how long BFQ should idle for next I/O
 276 request, when certain sync BFQ queues become empty. By default
 277 slice_idle is a non-zero value. Idling has a double purpose: boosting
 278 throughput and making sure that the desired throughput distribution is
 279 respected (see the description of how BFQ works, and, if needed, the
 280 papers referred there).
 281
 282 As for throughput, idling can be very helpful on highly seeky media
 283 like single spindle SATA/SAS disks where we can cut down on overall
 284 number of seeks and see improved throughput.
 285
 286 Setting slice_idle to 0 will remove all the idling on queues and one
 287 should see an overall improved throughput on faster storage devices
 288 like multiple SATA/SAS disks in hardware RAID configuration.
 289
 290 So depending on storage and workload, it might be useful to set
 291 slice_idle=0.  In general for SATA/SAS disks and software RAID of
 292 SATA/SAS disks keeping slice_idle enabled should be useful. For any
 293 configurations where there are multiple spindles behind single LUN
 294 (Host based hardware RAID controller or for storage arrays), setting
 295 slice_idle=0 might end up in better throughput and acceptable
 296 latencies.
 297
 298 Idling is however necessary to have service guarantees enforced in
 299 case of differentiated weights or differentiated I/O-request lengths.
 300 To see why, suppose that a given BFQ queue A must get several I/O
 301 requests served for each request served for another queue B. Idling
 302 ensures that, if A makes a new I/O request slightly after becoming
 303 empty, then no request of B is dispatched in the middle, and thus A
 304 does not lose the possibility to get more than one request dispatched
 305 before the next request of B is dispatched. Note that idling
 306 guarantees the desired differentiated treatment of queues only in
 307 terms of I/O-request dispatches. To guarantee that the actual service
 308 order then corresponds to the dispatch order, the strict_guarantees
 309 tunable must be set too.
 310
 311 There is an important flipside for idling: apart from the above cases
 312 where it is beneficial also for throughput, idling can severely impact
 313 throughput. One important case is random workload. Because of this
 314 issue, BFQ tends to avoid idling as much as possible, when it is not
 315 beneficial also for throughput. As a consequence of this behavior, and
 316 of further issues described for the strict_guarantees tunable,
 317 short-term service guarantees may be occasionally violated. And, in
 318 some cases, these guarantees may be more important than guaranteeing
 319 maximum throughput. For example, in video playing/streaming, a very
 320 low drop rate may be more important than maximum throughput. In these
 321 cases, consider setting the strict_guarantees parameter.
 322
 323 strict_guarantees
 324 -----------------
 325
 326 If this parameter is set (default: unset), then BFQ
 327
 328 - always performs idling when the in-service queue becomes empty;
 329
 330 - forces the device to serve one I/O request at a time, by dispatching a
 331   new request only if there is no outstanding request.
 332
 333 In the presence of differentiated weights or I/O-request sizes, both
 334 the above conditions are needed to guarantee that every BFQ queue
 335 receives its allotted share of the bandwidth. The first condition is
 336 needed for the reasons explained in the description of the slice_idle
 337 tunable.  The second condition is needed because all modern storage
 338 devices reorder internally-queued requests, which may trivially break
 339 the service guarantees enforced by the I/O scheduler.
 340
 341 Setting strict_guarantees may evidently affect throughput.
 342
 343 back_seek_max
 344 -------------
 345
 346 This specifies, given in Kbytes, the maximum "distance" for backward seeking.
 347 The distance is the amount of space from the current head location to the
 348 sectors that are backward in terms of distance.
 349
 350 This parameter allows the scheduler to anticipate requests in the "backward"
 351 direction and consider them as being the "next" if they are within this
 352 distance from the current head location.
 353
 354 back_seek_penalty
 355 -----------------
 356
 357 This parameter is used to compute the cost of backward seeking. If the
 358 backward distance of request is just 1/back_seek_penalty from a "front"
 359 request, then the seeking cost of two requests is considered equivalent.
 360
 361 So scheduler will not bias toward one or the other request (otherwise scheduler
 362 will bias toward front request). Default value of back_seek_penalty is 2.
 363
 364 fifo_expire_async
 365 -----------------
 366
 367 This parameter is used to set the timeout of asynchronous requests. Default
 368 value of this is 248ms.
 369
 370 fifo_expire_sync
 371 ----------------
 372
 373 This parameter is used to set the timeout of synchronous requests. Default
 374 value of this is 124ms. In case to favor synchronous requests over asynchronous
 375 one, this value should be decreased relative to fifo_expire_async.
 376
 377 low_latency
 378 -----------
 379
 380 This parameter is used to enable/disable BFQ's low latency mode. By
 381 default, low latency mode is enabled. If enabled, interactive and soft
 382 real-time applications are privileged and experience a lower latency,
 383 as explained in more detail in the description of how BFQ works.
 384
 385 DISABLE this mode if you need full control on bandwidth
 386 distribution. In fact, if it is enabled, then BFQ automatically
 387 increases the bandwidth share of privileged applications, as the main
 388 means to guarantee a lower latency to them.
 389
 390 In addition, as already highlighted at the beginning of this document,
 391 DISABLE this mode if your only goal is to achieve a high throughput.
 392 In fact, privileging the I/O of some application over the rest may
 393 entail a lower throughput. To achieve the highest-possible throughput
 394 on a non-rotational device, setting slice_idle to 0 may be needed too
 395 (at the cost of giving up any strong guarantee on fairness and low
 396 latency).
 397
 398 timeout_sync
 399 ------------
 400
 401 Maximum amount of device time that can be given to a task (queue) once
 402 it has been selected for service. On devices with costly seeks,
 403 increasing this time usually increases maximum throughput. On the
 404 opposite end, increasing this time coarsens the granularity of the
 405 short-term bandwidth and latency guarantees, especially if the
 406 following parameter is set to zero.
 407
 408 max_budget
 409 ----------
 410
 411 Maximum amount of service, measured in sectors, that can be provided
 412 to a BFQ queue once it is set in service (of course within the limits
 413 of the above timeout). According to what said in the description of
 414 the algorithm, larger values increase the throughput in proportion to
 415 the percentage of sequential I/O requests issued. The price of larger
 416 values is that they coarsen the granularity of short-term bandwidth
 417 and latency guarantees.
 418
 419 The default value is 0, which enables auto-tuning: BFQ sets max_budget
 420 to the maximum number of sectors that can be served during
 421 timeout_sync, according to the estimated peak rate.
 422
 423 weights
 424 -------
 425
 426 Read-only parameter, used to show the weights of the currently active
 427 BFQ queues.
 428
 429
 430 wr_ tunables
 431 ------------
 432
 433 BFQ exports a few parameters to control/tune the behavior of
 434 low-latency heuristics.
 435
 436 wr_coeff
 437
 438 Factor by which the weight of a weight-raised queue is multiplied. If
 439 the queue is deemed soft real-time, then the weight is further
 440 multiplied by an additional, constant factor.
 441
 442 wr_max_time
 443
 444 Maximum duration of a weight-raising period for an interactive task
 445 (ms). If set to zero (default value), then this value is computed
 446 automatically, as a function of the peak rate of the device. In any
 447 case, when the value of this parameter is read, it always reports the
 448 current duration, regardless of whether it has been set manually or
 449 computed automatically.
 450
 451 wr_max_softrt_rate
 452
 453 Maximum service rate below which a queue is deemed to be associated
 454 with a soft real-time application, and is then weight-raised
 455 accordingly (sectors/sec).
 456
 457 wr_min_idle_time
 458
 459 Minimum idle period after which interactive weight-raising may be
 460 reactivated for a queue (in ms).
 461
 462 wr_rt_max_time
 463
 464 Maximum weight-raising duration for soft real-time queues (in ms). The
 465 start time from which this duration is considered is automatically
 466 moved forward if the queue is detected to be still soft real-time
 467 before the current soft real-time weight-raising period finishes.
 468
 469 wr_min_inter_arr_async
 470
 471 Minimum period between I/O request arrivals after which weight-raising
 472 may be reactivated for an already busy async queue (in ms).
 473
 474
 475 4. Group scheduling with BFQ
 476 ============================
 477
 478 BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
 479 blkio and io. In particular, BFQ supports weight-based proportional
 480 share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
 481
 482 4-1 Service guarantees provided
 483 -------------------------------
 484
 485 With BFQ, proportional share means true proportional share of the
 486 device bandwidth, according to group weights. For example, a group
 487 with weight 200 gets twice the bandwidth, and not just twice the time,
 488 of a group with weight 100.
 489
 490 BFQ supports hierarchies (group trees) of any depth. Bandwidth is
 491 distributed among groups and processes in the expected way: for each
 492 group, the children of the group share the whole bandwidth of the
 493 group in proportion to their weights. In particular, this implies
 494 that, for each leaf group, every process of the group receives the
 495 same share of the whole group bandwidth, unless the ioprio of the
 496 process is modified.
 497
 498 The resource-sharing guarantee for a group may partially or totally
 499 switch from bandwidth to time, if providing bandwidth guarantees to
 500 the group lowers the throughput too much. This switch occurs on a
 501 per-process basis: if a process of a leaf group causes throughput loss
 502 if served in such a way to receive its share of the bandwidth, then
 503 BFQ switches back to just time-based proportional share for that
 504 process.
 505
 506 4-2 Interface
 507 -------------
 508
 509 To get proportional sharing of bandwidth with BFQ for a given device,
 510 BFQ must of course be the active scheduler for that device.
 511
 512 Within each group directory, the names of the files associated with
 513 BFQ-specific cgroup parameters and stats begin with the "bfq."
 514 prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
 515 BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
 516 parameter to set the weight of a group with BFQ is blkio.bfq.weight
 517 or io.bfq.weight.
 518
 519 Parameters to set
 520 -----------------
 521
 522 For each group, there is only the following parameter to set.
 523
 524 weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
 525 group inside its parent. Available values: 1..10000 (default 100). The
 526 linear mapping between ioprio and weights, described at the beginning
 527 of the tunable section, is still valid, but all weights higher than
 528 IOPRIO_BE_NR*10 are mapped to ioprio 0.
 529
 530 Recall that, if low-latency is set, then BFQ automatically raises the
 531 weight of the queues associated with interactive and soft real-time
 532 applications. Unset this tunable if you need/want to control weights.
 533
 534
 535 [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 536     Scheduler", Proceedings of the First Workshop on Mobile System
 537     Technologies (MST-2015), May 2015.
 538     http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
 539
 540 [2] P. Valente and M. Andreolini, "Improving Application
 541     Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
 542     the 5th Annual International Systems and Storage Conference
 543     (SYSTOR '12), June 2012.
 544     Slightly extended version:
 545     http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
 546                                                         results.pdf