Documentation/x86/intel_rdt_ui.txt

   1 User Interface for Resource Allocation in Intel Resource Director Technology
   2
   3 Copyright (C) 2016 Intel Corporation
   4
   5 Fenghua Yu <fenghua.yu@intel.com>
   6 Tony Luck <tony.luck@intel.com>
   7 Vikas Shivappa <vikas.shivappa@intel.com>
   8
   9 This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
  10 X86 /proc/cpuinfo flag bits:
  11 RDT (Resource Director Technology) Allocation - "rdt_a"
  12 CAT (Cache Allocation Technology) - "cat_l3", "cat_l2"
  13 CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2"
  14 CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc"
  15 MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local"
  16 MBA (Memory Bandwidth Allocation) - "mba"
  17
  18 To use the feature mount the file system:
  19
  20  # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
  21
  22 mount options are:
  23
  24 "cdp": Enable code/data prioritization in L3 cache allocations.
  25 "cdpl2": Enable code/data prioritization in L2 cache allocations.
  26 "mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA
  27  bandwidth in MBps
  28
  29 L2 and L3 CDP are controlled seperately.
  30
  31 RDT features are orthogonal. A particular system may support only
  32 monitoring, only control, or both monitoring and control.
  33
  34 The mount succeeds if either of allocation or monitoring is present, but
  35 only those files and directories supported by the system will be created.
  36 For more details on the behavior of the interface during monitoring
  37 and allocation, see the "Resource alloc and monitor groups" section.
  38
  39 Info directory
  40 --------------
  41
  42 The 'info' directory contains information about the enabled
  43 resources. Each resource has its own subdirectory. The subdirectory
  44 names reflect the resource names.
  45
  46 Each subdirectory contains the following files with respect to
  47 allocation:
  48
  49 Cache resource(L3/L2)  subdirectory contains the following files
  50 related to allocation:
  51
  52 "num_closids":          The number of CLOSIDs which are valid for this
  53                         resource. The kernel uses the smallest number of
  54                         CLOSIDs of all enabled resources as limit.
  55
  56 "cbm_mask":             The bitmask which is valid for this resource.
  57                         This mask is equivalent to 100%.
  58
  59 "min_cbm_bits":         The minimum number of consecutive bits which
  60                         must be set when writing a mask.
  61
  62 "shareable_bits":       Bitmask of shareable resource with other executing
  63                         entities (e.g. I/O). User can use this when
  64                         setting up exclusive cache partitions. Note that
  65                         some platforms support devices that have their
  66                         own settings for cache use which can over-ride
  67                         these bits.
  68
  69 Memory bandwitdh(MB) subdirectory contains the following files
  70 with respect to allocation:
  71
  72 "min_bandwidth":        The minimum memory bandwidth percentage which
  73                         user can request.
  74
  75 "bandwidth_gran":       The granularity in which the memory bandwidth
  76                         percentage is allocated. The allocated
  77                         b/w percentage is rounded off to the next
  78                         control step available on the hardware. The
  79                         available bandwidth control steps are:
  80                         min_bandwidth + N * bandwidth_gran.
  81
  82 "delay_linear":         Indicates if the delay scale is linear or
  83                         non-linear. This field is purely informational
  84                         only.
  85
  86 If RDT monitoring is available there will be an "L3_MON" directory
  87 with the following files:
  88
  89 "num_rmids":            The number of RMIDs available. This is the
  90                         upper bound for how many "CTRL_MON" + "MON"
  91                         groups can be created.
  92
  93 "mon_features": Lists the monitoring events if
  94                         monitoring is enabled for the resource.
  95
  96 "max_threshold_occupancy":
  97                         Read/write file provides the largest value (in
  98                         bytes) at which a previously used LLC_occupancy
  99                         counter can be considered for re-use.
 100
 101 Finally, in the top level of the "info" directory there is a file
 102 named "last_cmd_status". This is reset with every "command" issued
 103 via the file system (making new directories or writing to any of the
 104 control files). If the command was successful, it will read as "ok".
 105 If the command failed, it will provide more information that can be
 106 conveyed in the error returns from file operations. E.g.
 107
 108         # echo L3:0=f7 > schemata
 109         bash: echo: write error: Invalid argument
 110         # cat info/last_cmd_status
 111         mask f7 has non-consecutive 1-bits
 112
 113 Resource alloc and monitor groups
 114 ---------------------------------
 115
 116 Resource groups are represented as directories in the resctrl file
 117 system.  The default group is the root directory which, immediately
 118 after mounting, owns all the tasks and cpus in the system and can make
 119 full use of all resources.
 120
 121 On a system with RDT control features additional directories can be
 122 created in the root directory that specify different amounts of each
 123 resource (see "schemata" below). The root and these additional top level
 124 directories are referred to as "CTRL_MON" groups below.
 125
 126 On a system with RDT monitoring the root directory and other top level
 127 directories contain a directory named "mon_groups" in which additional
 128 directories can be created to monitor subsets of tasks in the CTRL_MON
 129 group that is their ancestor. These are called "MON" groups in the rest
 130 of this document.
 131
 132 Removing a directory will move all tasks and cpus owned by the group it
 133 represents to the parent. Removing one of the created CTRL_MON groups
 134 will automatically remove all MON groups below it.
 135
 136 All groups contain the following files:
 137
 138 "tasks":
 139         Reading this file shows the list of all tasks that belong to
 140         this group. Writing a task id to the file will add a task to the
 141         group. If the group is a CTRL_MON group the task is removed from
 142         whichever previous CTRL_MON group owned the task and also from
 143         any MON group that owned the task. If the group is a MON group,
 144         then the task must already belong to the CTRL_MON parent of this
 145         group. The task is removed from any previous MON group.
 146
 147
 148 "cpus":
 149         Reading this file shows a bitmask of the logical CPUs owned by
 150         this group. Writing a mask to this file will add and remove
 151         CPUs to/from this group. As with the tasks file a hierarchy is
 152         maintained where MON groups may only include CPUs owned by the
 153         parent CTRL_MON group.
 154
 155
 156 "cpus_list":
 157         Just like "cpus", only using ranges of CPUs instead of bitmasks.
 158
 159
 160 When control is enabled all CTRL_MON groups will also contain:
 161
 162 "schemata":
 163         A list of all the resources available to this group.
 164         Each resource has its own line and format - see below for details.
 165
 166 When monitoring is enabled all MON groups will also contain:
 167
 168 "mon_data":
 169         This contains a set of files organized by L3 domain and by
 170         RDT event. E.g. on a system with two L3 domains there will
 171         be subdirectories "mon_L3_00" and "mon_L3_01".  Each of these
 172         directories have one file per event (e.g. "llc_occupancy",
 173         "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
 174         files provide a read out of the current value of the event for
 175         all tasks in the group. In CTRL_MON groups these files provide
 176         the sum for all tasks in the CTRL_MON group and all tasks in
 177         MON groups. Please see example section for more details on usage.
 178
 179 Resource allocation rules
 180 -------------------------
 181 When a task is running the following rules define which resources are
 182 available to it:
 183
 184 1) If the task is a member of a non-default group, then the schemata
 185    for that group is used.
 186
 187 2) Else if the task belongs to the default group, but is running on a
 188    CPU that is assigned to some specific group, then the schemata for the
 189    CPU's group is used.
 190
 191 3) Otherwise the schemata for the default group is used.
 192
 193 Resource monitoring rules
 194 -------------------------
 195 1) If a task is a member of a MON group, or non-default CTRL_MON group
 196    then RDT events for the task will be reported in that group.
 197
 198 2) If a task is a member of the default CTRL_MON group, but is running
 199    on a CPU that is assigned to some specific group, then the RDT events
 200    for the task will be reported in that group.
 201
 202 3) Otherwise RDT events for the task will be reported in the root level
 203    "mon_data" group.
 204
 205
 206 Notes on cache occupancy monitoring and control
 207 -----------------------------------------------
 208 When moving a task from one group to another you should remember that
 209 this only affects *new* cache allocations by the task. E.g. you may have
 210 a task in a monitor group showing 3 MB of cache occupancy. If you move
 211 to a new group and immediately check the occupancy of the old and new
 212 groups you will likely see that the old group is still showing 3 MB and
 213 the new group zero. When the task accesses locations still in cache from
 214 before the move, the h/w does not update any counters. On a busy system
 215 you will likely see the occupancy in the old group go down as cache lines
 216 are evicted and re-used while the occupancy in the new group rises as
 217 the task accesses memory and loads into the cache are counted based on
 218 membership in the new group.
 219
 220 The same applies to cache allocation control. Moving a task to a group
 221 with a smaller cache partition will not evict any cache lines. The
 222 process may continue to use them from the old partition.
 223
 224 Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
 225 to identify a control group and a monitoring group respectively. Each of
 226 the resource groups are mapped to these IDs based on the kind of group. The
 227 number of CLOSid and RMID are limited by the hardware and hence the creation of
 228 a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
 229 and creation of "MON" group may fail if we run out of RMIDs.
 230
 231 max_threshold_occupancy - generic concepts
 232 ------------------------------------------
 233
 234 Note that an RMID once freed may not be immediately available for use as
 235 the RMID is still tagged the cache lines of the previous user of RMID.
 236 Hence such RMIDs are placed on limbo list and checked back if the cache
 237 occupancy has gone down. If there is a time when system has a lot of
 238 limbo RMIDs but which are not ready to be used, user may see an -EBUSY
 239 during mkdir.
 240
 241 max_threshold_occupancy is a user configurable value to determine the
 242 occupancy at which an RMID can be freed.
 243
 244 Schemata files - general concepts
 245 ---------------------------------
 246 Each line in the file describes one resource. The line starts with
 247 the name of the resource, followed by specific values to be applied
 248 in each of the instances of that resource on the system.
 249
 250 Cache IDs
 251 ---------
 252 On current generation systems there is one L3 cache per socket and L2
 253 caches are generally just shared by the hyperthreads on a core, but this
 254 isn't an architectural requirement. We could have multiple separate L3
 255 caches on a socket, multiple cores could share an L2 cache. So instead
 256 of using "socket" or "core" to define the set of logical cpus sharing
 257 a resource we use a "Cache ID". At a given cache level this will be a
 258 unique number across the whole system (but it isn't guaranteed to be a
 259 contiguous sequence, there may be gaps).  To find the ID for each logical
 260 CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
 261
 262 Cache Bit Masks (CBM)
 263 ---------------------
 264 For cache resources we describe the portion of the cache that is available
 265 for allocation using a bitmask. The maximum value of the mask is defined
 266 by each cpu model (and may be different for different cache levels). It
 267 is found using CPUID, but is also provided in the "info" directory of
 268 the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
 269 requires that these masks have all the '1' bits in a contiguous block. So
 270 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
 271 and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
 272 of the capacity of the cache. You could partition the cache into four
 273 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 274
 275 Memory bandwidth Allocation and monitoring
 276 ------------------------------------------
 277
 278 For Memory bandwidth resource, by default the user controls the resource
 279 by indicating the percentage of total memory bandwidth.
 280
 281 The minimum bandwidth percentage value for each cpu model is predefined
 282 and can be looked up through "info/MB/min_bandwidth". The bandwidth
 283 granularity that is allocated is also dependent on the cpu model and can
 284 be looked up at "info/MB/bandwidth_gran". The available bandwidth
 285 control steps are: min_bw + N * bw_gran. Intermediate values are rounded
 286 to the next control step available on the hardware.
 287
 288 The bandwidth throttling is a core specific mechanism on some of Intel
 289 SKUs. Using a high bandwidth and a low bandwidth setting on two threads
 290 sharing a core will result in both threads being throttled to use the
 291 low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core
 292 specific mechanism where as memory bandwidth monitoring(MBM) is done at
 293 the package level may lead to confusion when users try to apply control
 294 via the MBA and then monitor the bandwidth to see if the controls are
 295 effective. Below are such scenarios:
 296
 297 1. User may *not* see increase in actual bandwidth when percentage
 298    values are increased:
 299
 300 This can occur when aggregate L2 external bandwidth is more than L3
 301 external bandwidth. Consider an SKL SKU with 24 cores on a package and
 302 where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
 303 240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
 304 threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
 305 bandwidth of 100GBps although the percentage value specified is only 50%
 306 << 100%. Hence increasing the bandwidth percentage will not yeild any
 307 more bandwidth. This is because although the L2 external bandwidth still
 308 has capacity, the L3 external bandwidth is fully used. Also note that
 309 this would be dependent on number of cores the benchmark is run on.
 310
 311 2. Same bandwidth percentage may mean different actual bandwidth
 312    depending on # of threads:
 313
 314 For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
 315 thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
 316 they have same percentage bandwidth of 10%. This is simply because as
 317 threads start using more cores in an rdtgroup, the actual bandwidth may
 318 increase or vary although user specified bandwidth percentage is same.
 319
 320 In order to mitigate this and make the interface more user friendly,
 321 resctrl added support for specifying the bandwidth in MBps as well.  The
 322 kernel underneath would use a software feedback mechanism or a "Software
 323 Controller(mba_sc)" which reads the actual bandwidth using MBM counters
 324 and adjust the memowy bandwidth percentages to ensure
 325
 326         "actual bandwidth < user specified bandwidth".
 327
 328 By default, the schemata would take the bandwidth percentage values
 329 where as user can switch to the "MBA software controller" mode using
 330 a mount option 'mba_MBps'. The schemata format is specified in the below
 331 sections.
 332
 333 L3 schemata file details (code and data prioritization disabled)
 334 ----------------------------------------------------------------
 335 With CDP disabled the L3 schemata format is:
 336
 337         L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
 338
 339 L3 schemata file details (CDP enabled via mount option to resctrl)
 340 ------------------------------------------------------------------
 341 When CDP is enabled L3 control is split into two separate resources
 342 so you can specify independent masks for code and data like this:
 343
 344         L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
 345         L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
 346
 347 L2 schemata file details
 348 ------------------------
 349 L2 cache does not support code and data prioritization, so the
 350 schemata format is always:
 351
 352         L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
 353
 354 Memory bandwidth Allocation (default mode)
 355 ------------------------------------------
 356
 357 Memory b/w domain is L3 cache.
 358
 359         MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
 360
 361 Memory bandwidth Allocation specified in MBps
 362 ---------------------------------------------
 363
 364 Memory bandwidth domain is L3 cache.
 365
 366         MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
 367
 368 Reading/writing the schemata file
 369 ---------------------------------
 370 Reading the schemata file will show the state of all resources
 371 on all domains. When writing you only need to specify those values
 372 which you wish to change.  E.g.
 373
 374 # cat schemata
 375 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
 376 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
 377 # echo "L3DATA:2=3c0;" > schemata
 378 # cat schemata
 379 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
 380 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
 381
 382 Examples for RDT allocation usage:
 383
 384 Example 1
 385 ---------
 386 On a two socket machine (one L3 cache per socket) with just four bits
 387 for cache bit masks, minimum b/w of 10% with a memory bandwidth
 388 granularity of 10%
 389
 390 # mount -t resctrl resctrl /sys/fs/resctrl
 391 # cd /sys/fs/resctrl
 392 # mkdir p0 p1
 393 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
 394 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
 395
 396 The default resource group is unmodified, so we have access to all parts
 397 of all caches (its schemata file reads "L3:0=f;1=f").
 398
 399 Tasks that are under the control of group "p0" may only allocate from the
 400 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
 401 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
 402
 403 Similarly, tasks that are under the control of group "p0" may use a
 404 maximum memory b/w of 50% on socket0 and 50% on socket 1.
 405 Tasks in group "p1" may also use 50% memory b/w on both sockets.
 406 Note that unlike cache masks, memory b/w cannot specify whether these
 407 allocations can overlap or not. The allocations specifies the maximum
 408 b/w that the group may be able to use and the system admin can configure
 409 the b/w accordingly.
 410
 411 If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
 412 rather than the percentage values.
 413
 414 # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
 415 # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
 416
 417 In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
 418 of 1024MB where as on socket 1 they would use 500MB.
 419
 420 Example 2
 421 ---------
 422 Again two sockets, but this time with a more realistic 20-bit mask.
 423
 424 Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
 425 processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
 426 neighbors, each of the two real-time tasks exclusively occupies one quarter
 427 of L3 cache on socket 0.
 428
 429 # mount -t resctrl resctrl /sys/fs/resctrl
 430 # cd /sys/fs/resctrl
 431
 432 First we reset the schemata for the default group so that the "upper"
 433 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
 434 ordinary tasks:
 435
 436 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
 437
 438 Next we make a resource group for our first real time task and give
 439 it access to the "top" 25% of the cache on socket 0.
 440
 441 # mkdir p0
 442 # echo "L3:0=f8000;1=fffff" > p0/schemata
 443
 444 Finally we move our first real time task into this resource group. We
 445 also use taskset(1) to ensure the task always runs on a dedicated CPU
 446 on socket 0. Most uses of resource groups will also constrain which
 447 processors tasks run on.
 448
 449 # echo 1234 > p0/tasks
 450 # taskset -cp 1 1234
 451
 452 Ditto for the second real time task (with the remaining 25% of cache):
 453
 454 # mkdir p1
 455 # echo "L3:0=7c00;1=fffff" > p1/schemata
 456 # echo 5678 > p1/tasks
 457 # taskset -cp 2 5678
 458
 459 For the same 2 socket system with memory b/w resource and CAT L3 the
 460 schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
 461 10):
 462
 463 For our first real time task this would request 20% memory b/w on socket
 464 0.
 465
 466 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
 467
 468 For our second real time task this would request an other 20% memory b/w
 469 on socket 0.
 470
 471 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
 472
 473 Example 3
 474 ---------
 475
 476 A single socket system which has real-time tasks running on core 4-7 and
 477 non real-time workload assigned to core 0-3. The real-time tasks share text
 478 and data, so a per task association is not required and due to interaction
 479 with the kernel it's desired that the kernel on these cores shares L3 with
 480 the tasks.
 481
 482 # mount -t resctrl resctrl /sys/fs/resctrl
 483 # cd /sys/fs/resctrl
 484
 485 First we reset the schemata for the default group so that the "upper"
 486 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
 487 cannot be used by ordinary tasks:
 488
 489 # echo "L3:0=3ff\nMB:0=50" > schemata
 490
 491 Next we make a resource group for our real time cores and give it access
 492 to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
 493 socket 0.
 494
 495 # mkdir p0
 496 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
 497
 498 Finally we move core 4-7 over to the new group and make sure that the
 499 kernel and the tasks running there get 50% of the cache. They should
 500 also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
 501 siblings and only the real time threads are scheduled on the cores 4-7.
 502
 503 # echo F0 > p0/cpus
 504
 505 4) Locking between applications
 506
 507 Certain operations on the resctrl filesystem, composed of read/writes
 508 to/from multiple files, must be atomic.
 509
 510 As an example, the allocation of an exclusive reservation of L3 cache
 511 involves:
 512
 513   1. Read the cbmmasks from each directory
 514   2. Find a contiguous set of bits in the global CBM bitmask that is clear
 515      in any of the directory cbmmasks
 516   3. Create a new directory
 517   4. Set the bits found in step 2 to the new directory "schemata" file
 518
 519 If two applications attempt to allocate space concurrently then they can
 520 end up allocating the same bits so the reservations are shared instead of
 521 exclusive.
 522
 523 To coordinate atomic operations on the resctrlfs and to avoid the problem
 524 above, the following locking procedure is recommended:
 525
 526 Locking is based on flock, which is available in libc and also as a shell
 527 script command
 528
 529 Write lock:
 530
 531  A) Take flock(LOCK_EX) on /sys/fs/resctrl
 532  B) Read/write the directory structure.
 533  C) funlock
 534
 535 Read lock:
 536
 537  A) Take flock(LOCK_SH) on /sys/fs/resctrl
 538  B) If success read the directory structure.
 539  C) funlock
 540
 541 Example with bash:
 542
 543 # Atomically read directory structure
 544 $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
 545
 546 # Read directory contents and create new subdirectory
 547
 548 $ cat create-dir.sh
 549 find /sys/fs/resctrl/ > output.txt
 550 mask = function-of(output.txt)
 551 mkdir /sys/fs/resctrl/newres/
 552 echo mask > /sys/fs/resctrl/newres/schemata
 553
 554 $ flock /sys/fs/resctrl/ ./create-dir.sh
 555
 556 Example with C:
 557
 558 /*
 559  * Example code do take advisory locks
 560  * before accessing resctrl filesystem
 561  */
 562 #include <sys/file.h>
 563 #include <stdlib.h>
 564
 565 void resctrl_take_shared_lock(int fd)
 566 {
 567         int ret;
 568
 569         /* take shared lock on resctrl filesystem */
 570         ret = flock(fd, LOCK_SH);
 571         if (ret) {
 572                 perror("flock");
 573                 exit(-1);
 574         }
 575 }
 576
 577 void resctrl_take_exclusive_lock(int fd)
 578 {
 579         int ret;
 580
 581         /* release lock on resctrl filesystem */
 582         ret = flock(fd, LOCK_EX);
 583         if (ret) {
 584                 perror("flock");
 585                 exit(-1);
 586         }
 587 }
 588
 589 void resctrl_release_lock(int fd)
 590 {
 591         int ret;
 592
 593         /* take shared lock on resctrl filesystem */
 594         ret = flock(fd, LOCK_UN);
 595         if (ret) {
 596                 perror("flock");
 597                 exit(-1);
 598         }
 599 }
 600
 601 void main(void)
 602 {
 603         int fd, ret;
 604
 605         fd = open("/sys/fs/resctrl", O_DIRECTORY);
 606         if (fd == -1) {
 607                 perror("open");
 608                 exit(-1);
 609         }
 610         resctrl_take_shared_lock(fd);
 611         /* code to read directory contents */
 612         resctrl_release_lock(fd);
 613
 614         resctrl_take_exclusive_lock(fd);
 615         /* code to read and write directory contents */
 616         resctrl_release_lock(fd);
 617 }
 618
 619 Examples for RDT Monitoring along with allocation usage:
 620
 621 Reading monitored data
 622 ----------------------
 623 Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
 624 show the current snapshot of LLC occupancy of the corresponding MON
 625 group or CTRL_MON group.
 626
 627
 628 Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
 629 ---------
 630 On a two socket machine (one L3 cache per socket) with just four bits
 631 for cache bit masks
 632
 633 # mount -t resctrl resctrl /sys/fs/resctrl
 634 # cd /sys/fs/resctrl
 635 # mkdir p0 p1
 636 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
 637 # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
 638 # echo 5678 > p1/tasks
 639 # echo 5679 > p1/tasks
 640
 641 The default resource group is unmodified, so we have access to all parts
 642 of all caches (its schemata file reads "L3:0=f;1=f").
 643
 644 Tasks that are under the control of group "p0" may only allocate from the
 645 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
 646 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
 647
 648 Create monitor groups and assign a subset of tasks to each monitor group.
 649
 650 # cd /sys/fs/resctrl/p1/mon_groups
 651 # mkdir m11 m12
 652 # echo 5678 > m11/tasks
 653 # echo 5679 > m12/tasks
 654
 655 fetch data (data shown in bytes)
 656
 657 # cat m11/mon_data/mon_L3_00/llc_occupancy
 658 16234000
 659 # cat m11/mon_data/mon_L3_01/llc_occupancy
 660 14789000
 661 # cat m12/mon_data/mon_L3_00/llc_occupancy
 662 16789000
 663
 664 The parent ctrl_mon group shows the aggregated data.
 665
 666 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
 667 31234000
 668
 669 Example 2 (Monitor a task from its creation)
 670 ---------
 671 On a two socket machine (one L3 cache per socket)
 672
 673 # mount -t resctrl resctrl /sys/fs/resctrl
 674 # cd /sys/fs/resctrl
 675 # mkdir p0 p1
 676
 677 An RMID is allocated to the group once its created and hence the <cmd>
 678 below is monitored from its creation.
 679
 680 # echo $$ > /sys/fs/resctrl/p1/tasks
 681 # <cmd>
 682
 683 Fetch the data
 684
 685 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
 686 31789000
 687
 688 Example 3 (Monitor without CAT support or before creating CAT groups)
 689 ---------
 690
 691 Assume a system like HSW has only CQM and no CAT support. In this case
 692 the resctrl will still mount but cannot create CTRL_MON directories.
 693 But user can create different MON groups within the root group thereby
 694 able to monitor all tasks including kernel threads.
 695
 696 This can also be used to profile jobs cache size footprint before being
 697 able to allocate them to different allocation groups.
 698
 699 # mount -t resctrl resctrl /sys/fs/resctrl
 700 # cd /sys/fs/resctrl
 701 # mkdir mon_groups/m01
 702 # mkdir mon_groups/m02
 703
 704 # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
 705 # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
 706
 707 Monitor the groups separately and also get per domain data. From the
 708 below its apparent that the tasks are mostly doing work on
 709 domain(socket) 0.
 710
 711 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
 712 31234000
 713 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
 714 34555
 715 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
 716 31234000
 717 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
 718 32789
 719
 720
 721 Example 4 (Monitor real time tasks)
 722 -----------------------------------
 723
 724 A single socket system which has real time tasks running on cores 4-7
 725 and non real time tasks on other cpus. We want to monitor the cache
 726 occupancy of the real time threads on these cores.
 727
 728 # mount -t resctrl resctrl /sys/fs/resctrl
 729 # cd /sys/fs/resctrl
 730 # mkdir p1
 731
 732 Move the cpus 4-7 over to p1
 733 # echo f0 > p1/cpus
 734
 735 View the llc occupancy snapshot
 736
 737 # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
 738 11234000