Documentation/trace/ftrace.rst

   1 ========================
   2 ftrace - Function Tracer
   3 ========================
   4
   5 Copyright 2008 Red Hat Inc.
   6
   7 :Author:   Steven Rostedt <srostedt@redhat.com>
   8 :License:  The GNU Free Documentation License, Version 1.2
   9           (dual licensed under the GPL v2)
  10 :Original Reviewers:  Elias Oltmanns, Randy Dunlap, Andrew Morton,
  11                       John Kacur, and David Teigland.
  12
  13 - Written for: 2.6.28-rc2
  14 - Updated for: 3.10
  15 - Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
  16 - Converted to rst format - Changbin Du <changbin.du@intel.com>
  17
  18 Introduction
  19 ------------
  20
  21 Ftrace is an internal tracer designed to help out developers and
  22 designers of systems to find what is going on inside the kernel.
  23 It can be used for debugging or analyzing latencies and
  24 performance issues that take place outside of user-space.
  25
  26 Although ftrace is typically considered the function tracer, it
  27 is really a frame work of several assorted tracing utilities.
  28 There's latency tracing to examine what occurs between interrupts
  29 disabled and enabled, as well as for preemption and from a time
  30 a task is woken to the task is actually scheduled in.
  31
  32 One of the most common uses of ftrace is the event tracing.
  33 Through out the kernel is hundreds of static event points that
  34 can be enabled via the tracefs file system to see what is
  35 going on in certain parts of the kernel.
  36
  37 See events.txt for more information.
  38
  39
  40 Implementation Details
  41 ----------------------
  42
  43 See :doc:`ftrace-design` for details for arch porters and such.
  44
  45
  46 The File System
  47 ---------------
  48
  49 Ftrace uses the tracefs file system to hold the control files as
  50 well as the files to display output.
  51
  52 When tracefs is configured into the kernel (which selecting any ftrace
  53 option will do) the directory /sys/kernel/tracing will be created. To mount
  54 this directory, you can add to your /etc/fstab file::
  55
  56  tracefs       /sys/kernel/tracing       tracefs defaults        0       0
  57
  58 Or you can mount it at run time with::
  59
  60  mount -t tracefs nodev /sys/kernel/tracing
  61
  62 For quicker access to that directory you may want to make a soft link to
  63 it::
  64
  65  ln -s /sys/kernel/tracing /tracing
  66
  67 .. attention::
  68
  69   Before 4.1, all ftrace tracing control files were within the debugfs
  70   file system, which is typically located at /sys/kernel/debug/tracing.
  71   For backward compatibility, when mounting the debugfs file system,
  72   the tracefs file system will be automatically mounted at:
  73
  74   /sys/kernel/debug/tracing
  75
  76   All files located in the tracefs file system will be located in that
  77   debugfs file system directory as well.
  78
  79 .. attention::
  80
  81   Any selected ftrace option will also create the tracefs file system.
  82   The rest of the document will assume that you are in the ftrace directory
  83   (cd /sys/kernel/tracing) and will only concentrate on the files within that
  84   directory and not distract from the content with the extended
  85   "/sys/kernel/tracing" path name.
  86
  87 That's it! (assuming that you have ftrace configured into your kernel)
  88
  89 After mounting tracefs you will have access to the control and output files
  90 of ftrace. Here is a list of some of the key files:
  91
  92
  93  Note: all time values are in microseconds.
  94
  95   current_tracer:
  96
  97         This is used to set or display the current tracer
  98         that is configured.
  99
 100   available_tracers:
 101
 102         This holds the different types of tracers that
 103         have been compiled into the kernel. The
 104         tracers listed here can be configured by
 105         echoing their name into current_tracer.
 106
 107   tracing_on:
 108
 109         This sets or displays whether writing to the trace
 110         ring buffer is enabled. Echo 0 into this file to disable
 111         the tracer or 1 to enable it. Note, this only disables
 112         writing to the ring buffer, the tracing overhead may
 113         still be occurring.
 114
 115         The kernel function tracing_off() can be used within the
 116         kernel to disable writing to the ring buffer, which will
 117         set this file to "0". User space can re-enable tracing by
 118         echoing "1" into the file.
 119
 120         Note, the function and event trigger "traceoff" will also
 121         set this file to zero and stop tracing. Which can also
 122         be re-enabled by user space using this file.
 123
 124   trace:
 125
 126         This file holds the output of the trace in a human
 127         readable format (described below). Note, tracing is temporarily
 128         disabled while this file is being read (opened).
 129
 130   trace_pipe:
 131
 132         The output is the same as the "trace" file but this
 133         file is meant to be streamed with live tracing.
 134         Reads from this file will block until new data is
 135         retrieved.  Unlike the "trace" file, this file is a
 136         consumer. This means reading from this file causes
 137         sequential reads to display more current data. Once
 138         data is read from this file, it is consumed, and
 139         will not be read again with a sequential read. The
 140         "trace" file is static, and if the tracer is not
 141         adding more data, it will display the same
 142         information every time it is read. This file will not
 143         disable tracing while being read.
 144
 145   trace_options:
 146
 147         This file lets the user control the amount of data
 148         that is displayed in one of the above output
 149         files. Options also exist to modify how a tracer
 150         or events work (stack traces, timestamps, etc).
 151
 152   options:
 153
 154         This is a directory that has a file for every available
 155         trace option (also in trace_options). Options may also be set
 156         or cleared by writing a "1" or "0" respectively into the
 157         corresponding file with the option name.
 158
 159   tracing_max_latency:
 160
 161         Some of the tracers record the max latency.
 162         For example, the maximum time that interrupts are disabled.
 163         The maximum time is saved in this file. The max trace will also be
 164         stored, and displayed by "trace". A new max trace will only be
 165         recorded if the latency is greater than the value in this file
 166         (in microseconds).
 167
 168         By echoing in a time into this file, no latency will be recorded
 169         unless it is greater than the time in this file.
 170
 171   tracing_thresh:
 172
 173         Some latency tracers will record a trace whenever the
 174         latency is greater than the number in this file.
 175         Only active when the file contains a number greater than 0.
 176         (in microseconds)
 177
 178   buffer_size_kb:
 179
 180         This sets or displays the number of kilobytes each CPU
 181         buffer holds. By default, the trace buffers are the same size
 182         for each CPU. The displayed number is the size of the
 183         CPU buffer and not total size of all buffers. The
 184         trace buffers are allocated in pages (blocks of memory
 185         that the kernel uses for allocation, usually 4 KB in size).
 186         If the last page allocated has room for more bytes
 187         than requested, the rest of the page will be used,
 188         making the actual allocation bigger than requested or shown.
 189         ( Note, the size may not be a multiple of the page size
 190         due to buffer management meta-data. )
 191
 192         Buffer sizes for individual CPUs may vary
 193         (see "per_cpu/cpu0/buffer_size_kb" below), and if they do
 194         this file will show "X".
 195
 196   buffer_total_size_kb:
 197
 198         This displays the total combined size of all the trace buffers.
 199
 200   free_buffer:
 201
 202         If a process is performing tracing, and the ring buffer should be
 203         shrunk "freed" when the process is finished, even if it were to be
 204         killed by a signal, this file can be used for that purpose. On close
 205         of this file, the ring buffer will be resized to its minimum size.
 206         Having a process that is tracing also open this file, when the process
 207         exits its file descriptor for this file will be closed, and in doing so,
 208         the ring buffer will be "freed".
 209
 210         It may also stop tracing if disable_on_free option is set.
 211
 212   tracing_cpumask:
 213
 214         This is a mask that lets the user only trace on specified CPUs.
 215         The format is a hex string representing the CPUs.
 216
 217   set_ftrace_filter:
 218
 219         When dynamic ftrace is configured in (see the
 220         section below "dynamic ftrace"), the code is dynamically
 221         modified (code text rewrite) to disable calling of the
 222         function profiler (mcount). This lets tracing be configured
 223         in with practically no overhead in performance.  This also
 224         has a side effect of enabling or disabling specific functions
 225         to be traced. Echoing names of functions into this file
 226         will limit the trace to only those functions.
 227
 228         The functions listed in "available_filter_functions" are what
 229         can be written into this file.
 230
 231         This interface also allows for commands to be used. See the
 232         "Filter commands" section for more details.
 233
 234   set_ftrace_notrace:
 235
 236         This has an effect opposite to that of
 237         set_ftrace_filter. Any function that is added here will not
 238         be traced. If a function exists in both set_ftrace_filter
 239         and set_ftrace_notrace, the function will _not_ be traced.
 240
 241   set_ftrace_pid:
 242
 243         Have the function tracer only trace the threads whose PID are
 244         listed in this file.
 245
 246         If the "function-fork" option is set, then when a task whose
 247         PID is listed in this file forks, the child's PID will
 248         automatically be added to this file, and the child will be
 249         traced by the function tracer as well. This option will also
 250         cause PIDs of tasks that exit to be removed from the file.
 251
 252   set_event_pid:
 253
 254         Have the events only trace a task with a PID listed in this file.
 255         Note, sched_switch and sched_wake_up will also trace events
 256         listed in this file.
 257
 258         To have the PIDs of children of tasks with their PID in this file
 259         added on fork, enable the "event-fork" option. That option will also
 260         cause the PIDs of tasks to be removed from this file when the task
 261         exits.
 262
 263   set_graph_function:
 264
 265         Functions listed in this file will cause the function graph
 266         tracer to only trace these functions and the functions that
 267         they call. (See the section "dynamic ftrace" for more details).
 268
 269   set_graph_notrace:
 270
 271         Similar to set_graph_function, but will disable function graph
 272         tracing when the function is hit until it exits the function.
 273         This makes it possible to ignore tracing functions that are called
 274         by a specific function.
 275
 276   available_filter_functions:
 277
 278         This lists the functions that ftrace has processed and can trace.
 279         These are the function names that you can pass to
 280         "set_ftrace_filter" or "set_ftrace_notrace".
 281         (See the section "dynamic ftrace" below for more details.)
 282
 283   dyn_ftrace_total_info:
 284
 285         This file is for debugging purposes. The number of functions that
 286         have been converted to nops and are available to be traced.
 287
 288   enabled_functions:
 289
 290         This file is more for debugging ftrace, but can also be useful
 291         in seeing if any function has a callback attached to it.
 292         Not only does the trace infrastructure use ftrace function
 293         trace utility, but other subsystems might too. This file
 294         displays all functions that have a callback attached to them
 295         as well as the number of callbacks that have been attached.
 296         Note, a callback may also call multiple functions which will
 297         not be listed in this count.
 298
 299         If the callback registered to be traced by a function with
 300         the "save regs" attribute (thus even more overhead), a 'R'
 301         will be displayed on the same line as the function that
 302         is returning registers.
 303
 304         If the callback registered to be traced by a function with
 305         the "ip modify" attribute (thus the regs->ip can be changed),
 306         an 'I' will be displayed on the same line as the function that
 307         can be overridden.
 308
 309         If the architecture supports it, it will also show what callback
 310         is being directly called by the function. If the count is greater
 311         than 1 it most likely will be ftrace_ops_list_func().
 312
 313         If the callback of the function jumps to a trampoline that is
 314         specific to a the callback and not the standard trampoline,
 315         its address will be printed as well as the function that the
 316         trampoline calls.
 317
 318   function_profile_enabled:
 319
 320         When set it will enable all functions with either the function
 321         tracer, or if configured, the function graph tracer. It will
 322         keep a histogram of the number of functions that were called
 323         and if the function graph tracer was configured, it will also keep
 324         track of the time spent in those functions. The histogram
 325         content can be displayed in the files:
 326
 327         trace_stats/function<cpu> ( function0, function1, etc).
 328
 329   trace_stats:
 330
 331         A directory that holds different tracing stats.
 332
 333   kprobe_events:
 334
 335         Enable dynamic trace points. See kprobetrace.txt.
 336
 337   kprobe_profile:
 338
 339         Dynamic trace points stats. See kprobetrace.txt.
 340
 341   max_graph_depth:
 342
 343         Used with the function graph tracer. This is the max depth
 344         it will trace into a function. Setting this to a value of
 345         one will show only the first kernel function that is called
 346         from user space.
 347
 348   printk_formats:
 349
 350         This is for tools that read the raw format files. If an event in
 351         the ring buffer references a string, only a pointer to the string
 352         is recorded into the buffer and not the string itself. This prevents
 353         tools from knowing what that string was. This file displays the string
 354         and address for the string allowing tools to map the pointers to what
 355         the strings were.
 356
 357   saved_cmdlines:
 358
 359         Only the pid of the task is recorded in a trace event unless
 360         the event specifically saves the task comm as well. Ftrace
 361         makes a cache of pid mappings to comms to try to display
 362         comms for events. If a pid for a comm is not listed, then
 363         "<...>" is displayed in the output.
 364
 365         If the option "record-cmd" is set to "0", then comms of tasks
 366         will not be saved during recording. By default, it is enabled.
 367
 368   saved_cmdlines_size:
 369
 370         By default, 128 comms are saved (see "saved_cmdlines" above). To
 371         increase or decrease the amount of comms that are cached, echo
 372         in a the number of comms to cache, into this file.
 373
 374   saved_tgids:
 375
 376         If the option "record-tgid" is set, on each scheduling context switch
 377         the Task Group ID of a task is saved in a table mapping the PID of
 378         the thread to its TGID. By default, the "record-tgid" option is
 379         disabled.
 380
 381   snapshot:
 382
 383         This displays the "snapshot" buffer and also lets the user
 384         take a snapshot of the current running trace.
 385         See the "Snapshot" section below for more details.
 386
 387   stack_max_size:
 388
 389         When the stack tracer is activated, this will display the
 390         maximum stack size it has encountered.
 391         See the "Stack Trace" section below.
 392
 393   stack_trace:
 394
 395         This displays the stack back trace of the largest stack
 396         that was encountered when the stack tracer is activated.
 397         See the "Stack Trace" section below.
 398
 399   stack_trace_filter:
 400
 401         This is similar to "set_ftrace_filter" but it limits what
 402         functions the stack tracer will check.
 403
 404   trace_clock:
 405
 406         Whenever an event is recorded into the ring buffer, a
 407         "timestamp" is added. This stamp comes from a specified
 408         clock. By default, ftrace uses the "local" clock. This
 409         clock is very fast and strictly per cpu, but on some
 410         systems it may not be monotonic with respect to other
 411         CPUs. In other words, the local clocks may not be in sync
 412         with local clocks on other CPUs.
 413
 414         Usual clocks for tracing::
 415
 416           # cat trace_clock
 417           [local] global counter x86-tsc
 418
 419         The clock with the square brackets around it is the one in effect.
 420
 421         local:
 422                 Default clock, but may not be in sync across CPUs
 423
 424         global:
 425                 This clock is in sync with all CPUs but may
 426                 be a bit slower than the local clock.
 427
 428         counter:
 429                 This is not a clock at all, but literally an atomic
 430                 counter. It counts up one by one, but is in sync
 431                 with all CPUs. This is useful when you need to
 432                 know exactly the order events occurred with respect to
 433                 each other on different CPUs.
 434
 435         uptime:
 436                 This uses the jiffies counter and the time stamp
 437                 is relative to the time since boot up.
 438
 439         perf:
 440                 This makes ftrace use the same clock that perf uses.
 441                 Eventually perf will be able to read ftrace buffers
 442                 and this will help out in interleaving the data.
 443
 444         x86-tsc:
 445                 Architectures may define their own clocks. For
 446                 example, x86 uses its own TSC cycle clock here.
 447
 448         ppc-tb:
 449                 This uses the powerpc timebase register value.
 450                 This is in sync across CPUs and can also be used
 451                 to correlate events across hypervisor/guest if
 452                 tb_offset is known.
 453
 454         mono:
 455                 This uses the fast monotonic clock (CLOCK_MONOTONIC)
 456                 which is monotonic and is subject to NTP rate adjustments.
 457
 458         mono_raw:
 459                 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
 460                 which is montonic but is not subject to any rate adjustments
 461                 and ticks at the same rate as the hardware clocksource.
 462
 463         boot:
 464                 Same as mono. Used to be a separate clock which accounted
 465                 for the time spent in suspend while CLOCK_MONOTONIC did
 466                 not.
 467
 468         To set a clock, simply echo the clock name into this file::
 469
 470           # echo global > trace_clock
 471
 472   trace_marker:
 473
 474         This is a very useful file for synchronizing user space
 475         with events happening in the kernel. Writing strings into
 476         this file will be written into the ftrace buffer.
 477
 478         It is useful in applications to open this file at the start
 479         of the application and just reference the file descriptor
 480         for the file::
 481
 482                 void trace_write(const char *fmt, ...)
 483                 {
 484                         va_list ap;
 485                         char buf[256];
 486                         int n;
 487
 488                         if (trace_fd < 0)
 489                                 return;
 490
 491                         va_start(ap, fmt);
 492                         n = vsnprintf(buf, 256, fmt, ap);
 493                         va_end(ap);
 494
 495                         write(trace_fd, buf, n);
 496                 }
 497
 498         start::
 499
 500                 trace_fd = open("trace_marker", WR_ONLY);
 501
 502   trace_marker_raw:
 503
 504         This is similar to trace_marker above, but is meant for for binary data
 505         to be written to it, where a tool can be used to parse the data
 506         from trace_pipe_raw.
 507
 508   uprobe_events:
 509
 510         Add dynamic tracepoints in programs.
 511         See uprobetracer.txt
 512
 513   uprobe_profile:
 514
 515         Uprobe statistics. See uprobetrace.txt
 516
 517   instances:
 518
 519         This is a way to make multiple trace buffers where different
 520         events can be recorded in different buffers.
 521         See "Instances" section below.
 522
 523   events:
 524
 525         This is the trace event directory. It holds event tracepoints
 526         (also known as static tracepoints) that have been compiled
 527         into the kernel. It shows what event tracepoints exist
 528         and how they are grouped by system. There are "enable"
 529         files at various levels that can enable the tracepoints
 530         when a "1" is written to them.
 531
 532         See events.txt for more information.
 533
 534   set_event:
 535
 536         By echoing in the event into this file, will enable that event.
 537
 538         See events.txt for more information.
 539
 540   available_events:
 541
 542         A list of events that can be enabled in tracing.
 543
 544         See events.txt for more information.
 545
 546   hwlat_detector:
 547
 548         Directory for the Hardware Latency Detector.
 549         See "Hardware Latency Detector" section below.
 550
 551   per_cpu:
 552
 553         This is a directory that contains the trace per_cpu information.
 554
 555   per_cpu/cpu0/buffer_size_kb:
 556
 557         The ftrace buffer is defined per_cpu. That is, there's a separate
 558         buffer for each CPU to allow writes to be done atomically,
 559         and free from cache bouncing. These buffers may have different
 560         size buffers. This file is similar to the buffer_size_kb
 561         file, but it only displays or sets the buffer size for the
 562         specific CPU. (here cpu0).
 563
 564   per_cpu/cpu0/trace:
 565
 566         This is similar to the "trace" file, but it will only display
 567         the data specific for the CPU. If written to, it only clears
 568         the specific CPU buffer.
 569
 570   per_cpu/cpu0/trace_pipe
 571
 572         This is similar to the "trace_pipe" file, and is a consuming
 573         read, but it will only display (and consume) the data specific
 574         for the CPU.
 575
 576   per_cpu/cpu0/trace_pipe_raw
 577
 578         For tools that can parse the ftrace ring buffer binary format,
 579         the trace_pipe_raw file can be used to extract the data
 580         from the ring buffer directly. With the use of the splice()
 581         system call, the buffer data can be quickly transferred to
 582         a file or to the network where a server is collecting the
 583         data.
 584
 585         Like trace_pipe, this is a consuming reader, where multiple
 586         reads will always produce different data.
 587
 588   per_cpu/cpu0/snapshot:
 589
 590         This is similar to the main "snapshot" file, but will only
 591         snapshot the current CPU (if supported). It only displays
 592         the content of the snapshot for a given CPU, and if
 593         written to, only clears this CPU buffer.
 594
 595   per_cpu/cpu0/snapshot_raw:
 596
 597         Similar to the trace_pipe_raw, but will read the binary format
 598         from the snapshot buffer for the given CPU.
 599
 600   per_cpu/cpu0/stats:
 601
 602         This displays certain stats about the ring buffer:
 603
 604         entries:
 605                 The number of events that are still in the buffer.
 606
 607         overrun:
 608                 The number of lost events due to overwriting when
 609                 the buffer was full.
 610
 611         commit overrun:
 612                 Should always be zero.
 613                 This gets set if so many events happened within a nested
 614                 event (ring buffer is re-entrant), that it fills the
 615                 buffer and starts dropping events.
 616
 617         bytes:
 618                 Bytes actually read (not overwritten).
 619
 620         oldest event ts:
 621                 The oldest timestamp in the buffer
 622
 623         now ts:
 624                 The current timestamp
 625
 626         dropped events:
 627                 Events lost due to overwrite option being off.
 628
 629         read events:
 630                 The number of events read.
 631
 632 The Tracers
 633 -----------
 634
 635 Here is the list of current tracers that may be configured.
 636
 637   "function"
 638
 639         Function call tracer to trace all kernel functions.
 640
 641   "function_graph"
 642
 643         Similar to the function tracer except that the
 644         function tracer probes the functions on their entry
 645         whereas the function graph tracer traces on both entry
 646         and exit of the functions. It then provides the ability
 647         to draw a graph of function calls similar to C code
 648         source.
 649
 650   "blk"
 651
 652         The block tracer. The tracer used by the blktrace user
 653         application.
 654
 655   "hwlat"
 656
 657         The Hardware Latency tracer is used to detect if the hardware
 658         produces any latency. See "Hardware Latency Detector" section
 659         below.
 660
 661   "irqsoff"
 662
 663         Traces the areas that disable interrupts and saves
 664         the trace with the longest max latency.
 665         See tracing_max_latency. When a new max is recorded,
 666         it replaces the old trace. It is best to view this
 667         trace with the latency-format option enabled, which
 668         happens automatically when the tracer is selected.
 669
 670   "preemptoff"
 671
 672         Similar to irqsoff but traces and records the amount of
 673         time for which preemption is disabled.
 674
 675   "preemptirqsoff"
 676
 677         Similar to irqsoff and preemptoff, but traces and
 678         records the largest time for which irqs and/or preemption
 679         is disabled.
 680
 681   "wakeup"
 682
 683         Traces and records the max latency that it takes for
 684         the highest priority task to get scheduled after
 685         it has been woken up.
 686         Traces all tasks as an average developer would expect.
 687
 688   "wakeup_rt"
 689
 690         Traces and records the max latency that it takes for just
 691         RT tasks (as the current "wakeup" does). This is useful
 692         for those interested in wake up timings of RT tasks.
 693
 694   "wakeup_dl"
 695
 696         Traces and records the max latency that it takes for
 697         a SCHED_DEADLINE task to be woken (as the "wakeup" and
 698         "wakeup_rt" does).
 699
 700   "mmiotrace"
 701
 702         A special tracer that is used to trace binary module.
 703         It will trace all the calls that a module makes to the
 704         hardware. Everything it writes and reads from the I/O
 705         as well.
 706
 707   "branch"
 708
 709         This tracer can be configured when tracing likely/unlikely
 710         calls within the kernel. It will trace when a likely and
 711         unlikely branch is hit and if it was correct in its prediction
 712         of being correct.
 713
 714   "nop"
 715
 716         This is the "trace nothing" tracer. To remove all
 717         tracers from tracing simply echo "nop" into
 718         current_tracer.
 719
 720
 721 Examples of using the tracer
 722 ----------------------------
 723
 724 Here are typical examples of using the tracers when controlling
 725 them only with the tracefs interface (without using any
 726 user-land utilities).
 727
 728 Output format:
 729 --------------
 730
 731 Here is an example of the output format of the file "trace"::
 732
 733   # tracer: function
 734   #
 735   # entries-in-buffer/entries-written: 140080/250280   #P:4
 736   #
 737   #                              _-----=> irqs-off
 738   #                             / _----=> need-resched
 739   #                            | / _---=> hardirq/softirq
 740   #                            || / _--=> preempt-depth
 741   #                            ||| /     delay
 742   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 743   #              | |       |   ||||       |         |
 744               bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
 745               bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
 746               bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
 747               sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
 748               bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
 749               bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
 750               bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
 751               bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
 752               bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
 753               sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
 754               ....
 755
 756 A header is printed with the tracer name that is represented by
 757 the trace. In this case the tracer is "function". Then it shows the
 758 number of events in the buffer as well as the total number of entries
 759 that were written. The difference is the number of entries that were
 760 lost due to the buffer filling up (250280 - 140080 = 110200 events
 761 lost).
 762
 763 The header explains the content of the events. Task name "bash", the task
 764 PID "1977", the CPU that it was running on "000", the latency format
 765 (explained below), the timestamp in <secs>.<usecs> format, the
 766 function name that was traced "sys_close" and the parent function that
 767 called this function "system_call_fastpath". The timestamp is the time
 768 at which the function was entered.
 769
 770 Latency trace format
 771 --------------------
 772
 773 When the latency-format option is enabled or when one of the latency
 774 tracers is set, the trace file gives somewhat more information to see
 775 why a latency happened. Here is a typical trace::
 776
 777   # tracer: irqsoff
 778   #
 779   # irqsoff latency trace v1.1.5 on 3.8.0-test+
 780   # --------------------------------------------------------------------
 781   # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
 782   #    -----------------
 783   #    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
 784   #    -----------------
 785   #  => started at: __lock_task_sighand
 786   #  => ended at:   _raw_spin_unlock_irqrestore
 787   #
 788   #
 789   #                  _------=> CPU#
 790   #                 / _-----=> irqs-off
 791   #                | / _----=> need-resched
 792   #                || / _---=> hardirq/softirq
 793   #                ||| / _--=> preempt-depth
 794   #                |||| /     delay
 795   #  cmd     pid   ||||| time  |   caller
 796   #     \   /      |||||  \    |   /
 797         ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
 798         ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
 799         ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
 800         ps-6143    2d..1  306us : <stack trace>
 801    => trace_hardirqs_on_caller
 802    => trace_hardirqs_on
 803    => _raw_spin_unlock_irqrestore
 804    => do_task_stat
 805    => proc_tgid_stat
 806    => proc_single_show
 807    => seq_read
 808    => vfs_read
 809    => sys_read
 810    => system_call_fastpath
 811
 812
 813 This shows that the current tracer is "irqsoff" tracing the time
 814 for which interrupts were disabled. It gives the trace version (which
 815 never changes) and the version of the kernel upon which this was executed on
 816 (3.8). Then it displays the max latency in microseconds (259 us). The number
 817 of trace entries displayed and the total number (both are four: #4/4).
 818 VP, KP, SP, and HP are always zero and are reserved for later use.
 819 #P is the number of online CPUs (#P:4).
 820
 821 The task is the process that was running when the latency
 822 occurred. (ps pid: 6143).
 823
 824 The start and stop (the functions in which the interrupts were
 825 disabled and enabled respectively) that caused the latencies:
 826
 827   - __lock_task_sighand is where the interrupts were disabled.
 828   - _raw_spin_unlock_irqrestore is where they were enabled again.
 829
 830 The next lines after the header are the trace itself. The header
 831 explains which is which.
 832
 833   cmd: The name of the process in the trace.
 834
 835   pid: The PID of that process.
 836
 837   CPU#: The CPU which the process was running on.
 838
 839   irqs-off: 'd' interrupts are disabled. '.' otherwise.
 840         .. caution:: If the architecture does not support a way to
 841                 read the irq flags variable, an 'X' will always
 842                 be printed here.
 843
 844   need-resched:
 845         - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
 846         - 'n' only TIF_NEED_RESCHED is set,
 847         - 'p' only PREEMPT_NEED_RESCHED is set,
 848         - '.' otherwise.
 849
 850   hardirq/softirq:
 851         - 'Z' - NMI occurred inside a hardirq
 852         - 'z' - NMI is running
 853         - 'H' - hard irq occurred inside a softirq.
 854         - 'h' - hard irq is running
 855         - 's' - soft irq is running
 856         - '.' - normal context.
 857
 858   preempt-depth: The level of preempt_disabled
 859
 860 The above is mostly meaningful for kernel developers.
 861
 862   time:
 863         When the latency-format option is enabled, the trace file
 864         output includes a timestamp relative to the start of the
 865         trace. This differs from the output when latency-format
 866         is disabled, which includes an absolute timestamp.
 867
 868   delay:
 869         This is just to help catch your eye a bit better. And
 870         needs to be fixed to be only relative to the same CPU.
 871         The marks are determined by the difference between this
 872         current trace and the next trace.
 873
 874           - '$' - greater than 1 second
 875           - '@' - greater than 100 milisecond
 876           - '*' - greater than 10 milisecond
 877           - '#' - greater than 1000 microsecond
 878           - '!' - greater than 100 microsecond
 879           - '+' - greater than 10 microsecond
 880           - ' ' - less than or equal to 10 microsecond.
 881
 882   The rest is the same as the 'trace' file.
 883
 884   Note, the latency tracers will usually end with a back trace
 885   to easily find where the latency occurred.
 886
 887 trace_options
 888 -------------
 889
 890 The trace_options file (or the options directory) is used to control
 891 what gets printed in the trace output, or manipulate the tracers.
 892 To see what is available, simply cat the file::
 893
 894   cat trace_options
 895         print-parent
 896         nosym-offset
 897         nosym-addr
 898         noverbose
 899         noraw
 900         nohex
 901         nobin
 902         noblock
 903         trace_printk
 904         annotate
 905         nouserstacktrace
 906         nosym-userobj
 907         noprintk-msg-only
 908         context-info
 909         nolatency-format
 910         record-cmd
 911         norecord-tgid
 912         overwrite
 913         nodisable_on_free
 914         irq-info
 915         markers
 916         noevent-fork
 917         function-trace
 918         nofunction-fork
 919         nodisplay-graph
 920         nostacktrace
 921         nobranch
 922
 923 To disable one of the options, echo in the option prepended with
 924 "no"::
 925
 926   echo noprint-parent > trace_options
 927
 928 To enable an option, leave off the "no"::
 929
 930   echo sym-offset > trace_options
 931
 932 Here are the available options:
 933
 934   print-parent
 935         On function traces, display the calling (parent)
 936         function as well as the function being traced.
 937         ::
 938
 939           print-parent:
 940            bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
 941
 942           noprint-parent:
 943            bash-4000  [01]  1477.606694: simple_strtoul
 944
 945
 946   sym-offset
 947         Display not only the function name, but also the
 948         offset in the function. For example, instead of
 949         seeing just "ktime_get", you will see
 950         "ktime_get+0xb/0x20".
 951         ::
 952
 953           sym-offset:
 954            bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
 955
 956   sym-addr
 957         This will also display the function address as well
 958         as the function name.
 959         ::
 960
 961           sym-addr:
 962            bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
 963
 964   verbose
 965         This deals with the trace file when the
 966         latency-format option is enabled.
 967         ::
 968
 969             bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
 970             (+0.000ms): simple_strtoul (kstrtoul)
 971
 972   raw
 973         This will display raw numbers. This option is best for
 974         use with user applications that can translate the raw
 975         numbers better than having it done in the kernel.
 976
 977   hex
 978         Similar to raw, but the numbers will be in a hexadecimal format.
 979
 980   bin
 981         This will print out the formats in raw binary.
 982
 983   block
 984         When set, reading trace_pipe will not block when polled.
 985
 986   trace_printk
 987         Can disable trace_printk() from writing into the buffer.
 988
 989   annotate
 990         It is sometimes confusing when the CPU buffers are full
 991         and one CPU buffer had a lot of events recently, thus
 992         a shorter time frame, were another CPU may have only had
 993         a few events, which lets it have older events. When
 994         the trace is reported, it shows the oldest events first,
 995         and it may look like only one CPU ran (the one with the
 996         oldest events). When the annotate option is set, it will
 997         display when a new CPU buffer started::
 998
 999                           <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
1000                           <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
1001                           <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
1002                 ##### CPU 2 buffer started ####
1003                           <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
1004                           <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
1005                           <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
1006
1007   userstacktrace
1008         This option changes the trace. It records a
1009         stacktrace of the current user space thread after
1010         each trace event.
1011
1012   sym-userobj
1013         when user stacktrace are enabled, look up which
1014         object the address belongs to, and print a
1015         relative address. This is especially useful when
1016         ASLR is on, otherwise you don't get a chance to
1017         resolve the address to object/file/line after
1018         the app is no longer running
1019
1020         The lookup is performed when you read
1021         trace,trace_pipe. Example::
1022
1023                   a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
1024                   x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
1025
1026
1027   printk-msg-only
1028         When set, trace_printk()s will only show the format
1029         and not their parameters (if trace_bprintk() or
1030         trace_bputs() was used to save the trace_printk()).
1031
1032   context-info
1033         Show only the event data. Hides the comm, PID,
1034         timestamp, CPU, and other useful data.
1035
1036   latency-format
1037         This option changes the trace output. When it is enabled,
1038         the trace displays additional information about the
1039         latency, as described in "Latency trace format".
1040
1041   record-cmd
1042         When any event or tracer is enabled, a hook is enabled
1043         in the sched_switch trace point to fill comm cache
1044         with mapped pids and comms. But this may cause some
1045         overhead, and if you only care about pids, and not the
1046         name of the task, disabling this option can lower the
1047         impact of tracing. See "saved_cmdlines".
1048
1049   record-tgid
1050         When any event or tracer is enabled, a hook is enabled
1051         in the sched_switch trace point to fill the cache of
1052         mapped Thread Group IDs (TGID) mapping to pids. See
1053         "saved_tgids".
1054
1055   overwrite
1056         This controls what happens when the trace buffer is
1057         full. If "1" (default), the oldest events are
1058         discarded and overwritten. If "0", then the newest
1059         events are discarded.
1060         (see per_cpu/cpu0/stats for overrun and dropped)
1061
1062   disable_on_free
1063         When the free_buffer is closed, tracing will
1064         stop (tracing_on set to 0).
1065
1066   irq-info
1067         Shows the interrupt, preempt count, need resched data.
1068         When disabled, the trace looks like::
1069
1070                 # tracer: function
1071                 #
1072                 # entries-in-buffer/entries-written: 144405/9452052   #P:4
1073                 #
1074                 #           TASK-PID   CPU#      TIMESTAMP  FUNCTION
1075                 #              | |       |          |         |
1076                           <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1077                           <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
1078                           <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
1079
1080
1081   markers
1082         When set, the trace_marker is writable (only by root).
1083         When disabled, the trace_marker will error with EINVAL
1084         on write.
1085
1086   event-fork
1087         When set, tasks with PIDs listed in set_event_pid will have
1088         the PIDs of their children added to set_event_pid when those
1089         tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1090         their PIDs will be removed from the file.
1091
1092   function-trace
1093         The latency tracers will enable function tracing
1094         if this option is enabled (default it is). When
1095         it is disabled, the latency tracers do not trace
1096         functions. This keeps the overhead of the tracer down
1097         when performing latency tests.
1098
1099   function-fork
1100         When set, tasks with PIDs listed in set_ftrace_pid will
1101         have the PIDs of their children added to set_ftrace_pid
1102         when those tasks fork. Also, when tasks with PIDs in
1103         set_ftrace_pid exit, their PIDs will be removed from the
1104         file.
1105
1106   display-graph
1107         When set, the latency tracers (irqsoff, wakeup, etc) will
1108         use function graph tracing instead of function tracing.
1109
1110   stacktrace
1111         When set, a stack trace is recorded after any trace event
1112         is recorded.
1113
1114   branch
1115         Enable branch tracing with the tracer. This enables branch
1116         tracer along with the currently set tracer. Enabling this
1117         with the "nop" tracer is the same as just enabling the
1118         "branch" tracer.
1119
1120 .. tip:: Some tracers have their own options. They only appear in this
1121        file when the tracer is active. They always appear in the
1122        options directory.
1123
1124
1125 Here are the per tracer options:
1126
1127 Options for function tracer:
1128
1129   func_stack_trace
1130         When set, a stack trace is recorded after every
1131         function that is recorded. NOTE! Limit the functions
1132         that are recorded before enabling this, with
1133         "set_ftrace_filter" otherwise the system performance
1134         will be critically degraded. Remember to disable
1135         this option before clearing the function filter.
1136
1137 Options for function_graph tracer:
1138
1139  Since the function_graph tracer has a slightly different output
1140  it has its own options to control what is displayed.
1141
1142   funcgraph-overrun
1143         When set, the "overrun" of the graph stack is
1144         displayed after each function traced. The
1145         overrun, is when the stack depth of the calls
1146         is greater than what is reserved for each task.
1147         Each task has a fixed array of functions to
1148         trace in the call graph. If the depth of the
1149         calls exceeds that, the function is not traced.
1150         The overrun is the number of functions missed
1151         due to exceeding this array.
1152
1153   funcgraph-cpu
1154         When set, the CPU number of the CPU where the trace
1155         occurred is displayed.
1156
1157   funcgraph-overhead
1158         When set, if the function takes longer than
1159         A certain amount, then a delay marker is
1160         displayed. See "delay" above, under the
1161         header description.
1162
1163   funcgraph-proc
1164         Unlike other tracers, the process' command line
1165         is not displayed by default, but instead only
1166         when a task is traced in and out during a context
1167         switch. Enabling this options has the command
1168         of each process displayed at every line.
1169
1170   funcgraph-duration
1171         At the end of each function (the return)
1172         the duration of the amount of time in the
1173         function is displayed in microseconds.
1174
1175   funcgraph-abstime
1176         When set, the timestamp is displayed at each line.
1177
1178   funcgraph-irqs
1179         When disabled, functions that happen inside an
1180         interrupt will not be traced.
1181
1182   funcgraph-tail
1183         When set, the return event will include the function
1184         that it represents. By default this is off, and
1185         only a closing curly bracket "}" is displayed for
1186         the return of a function.
1187
1188   sleep-time
1189         When running function graph tracer, to include
1190         the time a task schedules out in its function.
1191         When enabled, it will account time the task has been
1192         scheduled out as part of the function call.
1193
1194   graph-time
1195         When running function profiler with function graph tracer,
1196         to include the time to call nested functions. When this is
1197         not set, the time reported for the function will only
1198         include the time the function itself executed for, not the
1199         time for functions that it called.
1200
1201 Options for blk tracer:
1202
1203   blk_classic
1204         Shows a more minimalistic output.
1205
1206
1207 irqsoff
1208 -------
1209
1210 When interrupts are disabled, the CPU can not react to any other
1211 external event (besides NMIs and SMIs). This prevents the timer
1212 interrupt from triggering or the mouse interrupt from letting
1213 the kernel know of a new mouse event. The result is a latency
1214 with the reaction time.
1215
1216 The irqsoff tracer tracks the time for which interrupts are
1217 disabled. When a new maximum latency is hit, the tracer saves
1218 the trace leading up to that latency point so that every time a
1219 new maximum is reached, the old saved trace is discarded and the
1220 new trace is saved.
1221
1222 To reset the maximum, echo 0 into tracing_max_latency. Here is
1223 an example::
1224
1225   # echo 0 > options/function-trace
1226   # echo irqsoff > current_tracer
1227   # echo 1 > tracing_on
1228   # echo 0 > tracing_max_latency
1229   # ls -ltr
1230   [...]
1231   # echo 0 > tracing_on
1232   # cat trace
1233   # tracer: irqsoff
1234   #
1235   # irqsoff latency trace v1.1.5 on 3.8.0-test+
1236   # --------------------------------------------------------------------
1237   # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1238   #    -----------------
1239   #    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1240   #    -----------------
1241   #  => started at: run_timer_softirq
1242   #  => ended at:   run_timer_softirq
1243   #
1244   #
1245   #                  _------=> CPU#
1246   #                 / _-----=> irqs-off
1247   #                | / _----=> need-resched
1248   #                || / _---=> hardirq/softirq
1249   #                ||| / _--=> preempt-depth
1250   #                |||| /     delay
1251   #  cmd     pid   ||||| time  |   caller
1252   #     \   /      |||||  \    |   /
1253     <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
1254     <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
1255     <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
1256     <idle>-0       0dNs3   25us : <stack trace>
1257    => _raw_spin_unlock_irq
1258    => run_timer_softirq
1259    => __do_softirq
1260    => call_softirq
1261    => do_softirq
1262    => irq_exit
1263    => smp_apic_timer_interrupt
1264    => apic_timer_interrupt
1265    => rcu_idle_exit
1266    => cpu_idle
1267    => rest_init
1268    => start_kernel
1269    => x86_64_start_reservations
1270    => x86_64_start_kernel
1271
1272 Here we see that that we had a latency of 16 microseconds (which is
1273 very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1274 interrupts. The difference between the 16 and the displayed
1275 timestamp 25us occurred because the clock was incremented
1276 between the time of recording the max latency and the time of
1277 recording the function that had that latency.
1278
1279 Note the above example had function-trace not set. If we set
1280 function-trace, we get a much larger output::
1281
1282  with echo 1 > options/function-trace
1283
1284   # tracer: irqsoff
1285   #
1286   # irqsoff latency trace v1.1.5 on 3.8.0-test+
1287   # --------------------------------------------------------------------
1288   # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1289   #    -----------------
1290   #    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1291   #    -----------------
1292   #  => started at: ata_scsi_queuecmd
1293   #  => ended at:   ata_scsi_queuecmd
1294   #
1295   #
1296   #                  _------=> CPU#
1297   #                 / _-----=> irqs-off
1298   #                | / _----=> need-resched
1299   #                || / _---=> hardirq/softirq
1300   #                ||| / _--=> preempt-depth
1301   #                |||| /     delay
1302   #  cmd     pid   ||||| time  |   caller
1303   #     \   /      |||||  \    |   /
1304       bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1305       bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
1306       bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1307       bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1308       bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1309       bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
1310       bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
1311       bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1312       bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1313   [...]
1314       bash-2042    3d..1   67us : delay_tsc <-__delay
1315       bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1316       bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
1317       bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1318       bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
1319       bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1320       bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1321       bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1322       bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1323       bash-2042    3d..1  120us : <stack trace>
1324    => _raw_spin_unlock_irqrestore
1325    => ata_scsi_queuecmd
1326    => scsi_dispatch_cmd
1327    => scsi_request_fn
1328    => __blk_run_queue_uncond
1329    => __blk_run_queue
1330    => blk_queue_bio
1331    => generic_make_request
1332    => submit_bio
1333    => submit_bh
1334    => __ext3_get_inode_loc
1335    => ext3_iget
1336    => ext3_lookup
1337    => lookup_real
1338    => __lookup_hash
1339    => walk_component
1340    => lookup_last
1341    => path_lookupat
1342    => filename_lookup
1343    => user_path_at_empty
1344    => user_path_at
1345    => vfs_fstatat
1346    => vfs_stat
1347    => sys_newstat
1348    => system_call_fastpath
1349
1350
1351 Here we traced a 71 microsecond latency. But we also see all the
1352 functions that were called during that time. Note that by
1353 enabling function tracing, we incur an added overhead. This
1354 overhead may extend the latency times. But nevertheless, this
1355 trace has provided some very helpful debugging information.
1356
1357
1358 preemptoff
1359 ----------
1360
1361 When preemption is disabled, we may be able to receive
1362 interrupts but the task cannot be preempted and a higher
1363 priority task must wait for preemption to be enabled again
1364 before it can preempt a lower priority task.
1365
1366 The preemptoff tracer traces the places that disable preemption.
1367 Like the irqsoff tracer, it records the maximum latency for
1368 which preemption was disabled. The control of preemptoff tracer
1369 is much like the irqsoff tracer.
1370 ::
1371
1372   # echo 0 > options/function-trace
1373   # echo preemptoff > current_tracer
1374   # echo 1 > tracing_on
1375   # echo 0 > tracing_max_latency
1376   # ls -ltr
1377   [...]
1378   # echo 0 > tracing_on
1379   # cat trace
1380   # tracer: preemptoff
1381   #
1382   # preemptoff latency trace v1.1.5 on 3.8.0-test+
1383   # --------------------------------------------------------------------
1384   # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1385   #    -----------------
1386   #    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1387   #    -----------------
1388   #  => started at: do_IRQ
1389   #  => ended at:   do_IRQ
1390   #
1391   #
1392   #                  _------=> CPU#
1393   #                 / _-----=> irqs-off
1394   #                | / _----=> need-resched
1395   #                || / _---=> hardirq/softirq
1396   #                ||| / _--=> preempt-depth
1397   #                |||| /     delay
1398   #  cmd     pid   ||||| time  |   caller
1399   #     \   /      |||||  \    |   /
1400       sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
1401       sshd-1991    1d..1   46us : irq_exit <-do_IRQ
1402       sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
1403       sshd-1991    1d..1   52us : <stack trace>
1404    => sub_preempt_count
1405    => irq_exit
1406    => do_IRQ
1407    => ret_from_intr
1408
1409
1410 This has some more changes. Preemption was disabled when an
1411 interrupt came in (notice the 'h'), and was enabled on exit.
1412 But we also see that interrupts have been disabled when entering
1413 the preempt off section and leaving it (the 'd'). We do not know if
1414 interrupts were enabled in the mean time or shortly after this
1415 was over.
1416 ::
1417
1418   # tracer: preemptoff
1419   #
1420   # preemptoff latency trace v1.1.5 on 3.8.0-test+
1421   # --------------------------------------------------------------------
1422   # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1423   #    -----------------
1424   #    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1425   #    -----------------
1426   #  => started at: wake_up_new_task
1427   #  => ended at:   task_rq_unlock
1428   #
1429   #
1430   #                  _------=> CPU#
1431   #                 / _-----=> irqs-off
1432   #                | / _----=> need-resched
1433   #                || / _---=> hardirq/softirq
1434   #                ||| / _--=> preempt-depth
1435   #                |||| /     delay
1436   #  cmd     pid   ||||| time  |   caller
1437   #     \   /      |||||  \    |   /
1438       bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
1439       bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
1440       bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
1441       bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1442       bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1443   [...]
1444       bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
1445       bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
1446       bash-1994    1d..1   13us : add_preempt_count <-irq_enter
1447       bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
1448       bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1449       bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
1450       bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
1451       bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
1452   [...]
1453       bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
1454       bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
1455       bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
1456       bash-1994    1d..2   36us : do_softirq <-irq_exit
1457       bash-1994    1d..2   36us : __do_softirq <-call_softirq
1458       bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
1459       bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
1460       bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
1461       bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
1462       bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
1463   [...]
1464       bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
1465       bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
1466       bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
1467       bash-1994    1dN.2   82us : idle_cpu <-irq_exit
1468       bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
1469       bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
1470       bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1471       bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
1472       bash-1994    1.N.1  104us : <stack trace>
1473    => sub_preempt_count
1474    => _raw_spin_unlock_irqrestore
1475    => task_rq_unlock
1476    => wake_up_new_task
1477    => do_fork
1478    => sys_clone
1479    => stub_clone
1480
1481
1482 The above is an example of the preemptoff trace with
1483 function-trace set. Here we see that interrupts were not disabled
1484 the entire time. The irq_enter code lets us know that we entered
1485 an interrupt 'h'. Before that, the functions being traced still
1486 show that it is not in an interrupt, but we can see from the
1487 functions themselves that this is not the case.
1488
1489 preemptirqsoff
1490 --------------
1491
1492 Knowing the locations that have interrupts disabled or
1493 preemption disabled for the longest times is helpful. But
1494 sometimes we would like to know when either preemption and/or
1495 interrupts are disabled.
1496
1497 Consider the following code::
1498
1499     local_irq_disable();
1500     call_function_with_irqs_off();
1501     preempt_disable();
1502     call_function_with_irqs_and_preemption_off();
1503     local_irq_enable();
1504     call_function_with_preemption_off();
1505     preempt_enable();
1506
1507 The irqsoff tracer will record the total length of
1508 call_function_with_irqs_off() and
1509 call_function_with_irqs_and_preemption_off().
1510
1511 The preemptoff tracer will record the total length of
1512 call_function_with_irqs_and_preemption_off() and
1513 call_function_with_preemption_off().
1514
1515 But neither will trace the time that interrupts and/or
1516 preemption is disabled. This total time is the time that we can
1517 not schedule. To record this time, use the preemptirqsoff
1518 tracer.
1519
1520 Again, using this trace is much like the irqsoff and preemptoff
1521 tracers.
1522 ::
1523
1524   # echo 0 > options/function-trace
1525   # echo preemptirqsoff > current_tracer
1526   # echo 1 > tracing_on
1527   # echo 0 > tracing_max_latency
1528   # ls -ltr
1529   [...]
1530   # echo 0 > tracing_on
1531   # cat trace
1532   # tracer: preemptirqsoff
1533   #
1534   # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1535   # --------------------------------------------------------------------
1536   # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1537   #    -----------------
1538   #    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1539   #    -----------------
1540   #  => started at: ata_scsi_queuecmd
1541   #  => ended at:   ata_scsi_queuecmd
1542   #
1543   #
1544   #                  _------=> CPU#
1545   #                 / _-----=> irqs-off
1546   #                | / _----=> need-resched
1547   #                || / _---=> hardirq/softirq
1548   #                ||| / _--=> preempt-depth
1549   #                |||| /     delay
1550   #  cmd     pid   ||||| time  |   caller
1551   #     \   /      |||||  \    |   /
1552         ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1553         ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1554         ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
1555         ls-2230    3...1  111us : <stack trace>
1556    => sub_preempt_count
1557    => _raw_spin_unlock_irqrestore
1558    => ata_scsi_queuecmd
1559    => scsi_dispatch_cmd
1560    => scsi_request_fn
1561    => __blk_run_queue_uncond
1562    => __blk_run_queue
1563    => blk_queue_bio
1564    => generic_make_request
1565    => submit_bio
1566    => submit_bh
1567    => ext3_bread
1568    => ext3_dir_bread
1569    => htree_dirblock_to_tree
1570    => ext3_htree_fill_tree
1571    => ext3_readdir
1572    => vfs_readdir
1573    => sys_getdents
1574    => system_call_fastpath
1575
1576
1577 The trace_hardirqs_off_thunk is called from assembly on x86 when
1578 interrupts are disabled in the assembly code. Without the
1579 function tracing, we do not know if interrupts were enabled
1580 within the preemption points. We do see that it started with
1581 preemption enabled.
1582
1583 Here is a trace with function-trace set::
1584
1585   # tracer: preemptirqsoff
1586   #
1587   # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1588   # --------------------------------------------------------------------
1589   # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1590   #    -----------------
1591   #    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1592   #    -----------------
1593   #  => started at: schedule
1594   #  => ended at:   mutex_unlock
1595   #
1596   #
1597   #                  _------=> CPU#
1598   #                 / _-----=> irqs-off
1599   #                | / _----=> need-resched
1600   #                || / _---=> hardirq/softirq
1601   #                ||| / _--=> preempt-depth
1602   #                |||| /     delay
1603   #  cmd     pid   ||||| time  |   caller
1604   #     \   /      |||||  \    |   /
1605   kworker/-59      3...1    0us : __schedule <-schedule
1606   kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
1607   kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
1608   kworker/-59      3d..2    1us : deactivate_task <-__schedule
1609   kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
1610   kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
1611   kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
1612   kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
1613   kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
1614   kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
1615   kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
1616   kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
1617   kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1618   kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
1619   kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
1620   kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
1621   kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
1622   kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
1623   kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
1624   kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
1625   kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
1626   kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
1627   kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
1628   kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
1629   kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
1630         ls-2269    3d..2    7us : finish_task_switch <-__schedule
1631         ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
1632         ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
1633         ls-2269    3d..2    8us : irq_enter <-do_IRQ
1634         ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
1635         ls-2269    3d..2    9us : add_preempt_count <-irq_enter
1636         ls-2269    3d.h2    9us : exit_idle <-do_IRQ
1637   [...]
1638         ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
1639         ls-2269    3d.h2   20us : irq_exit <-do_IRQ
1640         ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
1641         ls-2269    3d..3   21us : do_softirq <-irq_exit
1642         ls-2269    3d..3   21us : __do_softirq <-call_softirq
1643         ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
1644         ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
1645         ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
1646         ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
1647         ls-2269    3d.s5   31us : irq_enter <-do_IRQ
1648         ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1649   [...]
1650         ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1651         ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
1652         ls-2269    3d.H5   32us : exit_idle <-do_IRQ
1653         ls-2269    3d.H5   32us : handle_irq <-do_IRQ
1654         ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
1655         ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
1656   [...]
1657         ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1658         ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1659         ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
1660         ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
1661         ls-2269    3d..3  159us : idle_cpu <-irq_exit
1662         ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
1663         ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
1664         ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
1665         ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
1666         ls-2269    3d...  186us : <stack trace>
1667    => __mutex_unlock_slowpath
1668    => mutex_unlock
1669    => process_output
1670    => n_tty_write
1671    => tty_write
1672    => vfs_write
1673    => sys_write
1674    => system_call_fastpath
1675
1676 This is an interesting trace. It started with kworker running and
1677 scheduling out and ls taking over. But as soon as ls released the
1678 rq lock and enabled interrupts (but not preemption) an interrupt
1679 triggered. When the interrupt finished, it started running softirqs.
1680 But while the softirq was running, another interrupt triggered.
1681 When an interrupt is running inside a softirq, the annotation is 'H'.
1682
1683
1684 wakeup
1685 ------
1686
1687 One common case that people are interested in tracing is the
1688 time it takes for a task that is woken to actually wake up.
1689 Now for non Real-Time tasks, this can be arbitrary. But tracing
1690 it none the less can be interesting.
1691
1692 Without function tracing::
1693
1694   # echo 0 > options/function-trace
1695   # echo wakeup > current_tracer
1696   # echo 1 > tracing_on
1697   # echo 0 > tracing_max_latency
1698   # chrt -f 5 sleep 1
1699   # echo 0 > tracing_on
1700   # cat trace
1701   # tracer: wakeup
1702   #
1703   # wakeup latency trace v1.1.5 on 3.8.0-test+
1704   # --------------------------------------------------------------------
1705   # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1706   #    -----------------
1707   #    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1708   #    -----------------
1709   #
1710   #                  _------=> CPU#
1711   #                 / _-----=> irqs-off
1712   #                | / _----=> need-resched
1713   #                || / _---=> hardirq/softirq
1714   #                ||| / _--=> preempt-depth
1715   #                |||| /     delay
1716   #  cmd     pid   ||||| time  |   caller
1717   #     \   /      |||||  \    |   /
1718     <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
1719     <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1720     <idle>-0       3d..3   15us : __schedule <-schedule
1721     <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
1722
1723 The tracer only traces the highest priority task in the system
1724 to avoid tracing the normal circumstances. Here we see that
1725 the kworker with a nice priority of -20 (not very nice), took
1726 just 15 microseconds from the time it woke up, to the time it
1727 ran.
1728
1729 Non Real-Time tasks are not that interesting. A more interesting
1730 trace is to concentrate only on Real-Time tasks.
1731
1732 wakeup_rt
1733 ---------
1734
1735 In a Real-Time environment it is very important to know the
1736 wakeup time it takes for the highest priority task that is woken
1737 up to the time that it executes. This is also known as "schedule
1738 latency". I stress the point that this is about RT tasks. It is
1739 also important to know the scheduling latency of non-RT tasks,
1740 but the average schedule latency is better for non-RT tasks.
1741 Tools like LatencyTop are more appropriate for such
1742 measurements.
1743
1744 Real-Time environments are interested in the worst case latency.
1745 That is the longest latency it takes for something to happen,
1746 and not the average. We can have a very fast scheduler that may
1747 only have a large latency once in a while, but that would not
1748 work well with Real-Time tasks.  The wakeup_rt tracer was designed
1749 to record the worst case wakeups of RT tasks. Non-RT tasks are
1750 not recorded because the tracer only records one worst case and
1751 tracing non-RT tasks that are unpredictable will overwrite the
1752 worst case latency of RT tasks (just run the normal wakeup
1753 tracer for a while to see that effect).
1754
1755 Since this tracer only deals with RT tasks, we will run this
1756 slightly differently than we did with the previous tracers.
1757 Instead of performing an 'ls', we will run 'sleep 1' under
1758 'chrt' which changes the priority of the task.
1759 ::
1760
1761   # echo 0 > options/function-trace
1762   # echo wakeup_rt > current_tracer
1763   # echo 1 > tracing_on
1764   # echo 0 > tracing_max_latency
1765   # chrt -f 5 sleep 1
1766   # echo 0 > tracing_on
1767   # cat trace
1768   # tracer: wakeup
1769   #
1770   # tracer: wakeup_rt
1771   #
1772   # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1773   # --------------------------------------------------------------------
1774   # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1775   #    -----------------
1776   #    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1777   #    -----------------
1778   #
1779   #                  _------=> CPU#
1780   #                 / _-----=> irqs-off
1781   #                | / _----=> need-resched
1782   #                || / _---=> hardirq/softirq
1783   #                ||| / _--=> preempt-depth
1784   #                |||| /     delay
1785   #  cmd     pid   ||||| time  |   caller
1786   #     \   /      |||||  \    |   /
1787     <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
1788     <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1789     <idle>-0       3d..3    5us : __schedule <-schedule
1790     <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1791
1792
1793 Running this on an idle system, we see that it only took 5 microseconds
1794 to perform the task switch.  Note, since the trace point in the schedule
1795 is before the actual "switch", we stop the tracing when the recorded task
1796 is about to schedule in. This may change if we add a new marker at the
1797 end of the scheduler.
1798
1799 Notice that the recorded task is 'sleep' with the PID of 2389
1800 and it has an rt_prio of 5. This priority is user-space priority
1801 and not the internal kernel priority. The policy is 1 for
1802 SCHED_FIFO and 2 for SCHED_RR.
1803
1804 Note, that the trace data shows the internal priority (99 - rtprio).
1805 ::
1806
1807   <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1808
1809 The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
1810 and in the running state 'R'. The sleep task was scheduled in with
1811 2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1812 and it too is in the running state.
1813
1814 Doing the same with chrt -r 5 and function-trace set.
1815 ::
1816
1817   echo 1 > options/function-trace
1818
1819   # tracer: wakeup_rt
1820   #
1821   # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1822   # --------------------------------------------------------------------
1823   # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1824   #    -----------------
1825   #    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
1826   #    -----------------
1827   #
1828   #                  _------=> CPU#
1829   #                 / _-----=> irqs-off
1830   #                | / _----=> need-resched
1831   #                || / _---=> hardirq/softirq
1832   #                ||| / _--=> preempt-depth
1833   #                |||| /     delay
1834   #  cmd     pid   ||||| time  |   caller
1835   #     \   /      |||||  \    |   /
1836     <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
1837     <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1838     <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
1839     <idle>-0       3d.h3    3us : resched_curr <-check_preempt_curr
1840     <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
1841     <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
1842     <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
1843     <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
1844     <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
1845     <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1846     <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
1847     <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
1848     <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
1849     <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
1850     <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
1851     <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
1852     <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
1853     <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
1854     <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
1855     <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
1856     <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
1857     <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
1858     <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
1859     <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
1860     <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
1861     <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
1862     <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
1863     <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
1864     <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
1865     <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
1866     <idle>-0       3dN.1   13us : cpu_load_update_nohz <-tick_nohz_idle_exit
1867     <idle>-0       3dN.1   13us : _raw_spin_lock <-cpu_load_update_nohz
1868     <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
1869     <idle>-0       3dN.2   13us : __cpu_load_update <-cpu_load_update_nohz
1870     <idle>-0       3dN.2   14us : sched_avg_update <-__cpu_load_update
1871     <idle>-0       3dN.2   14us : _raw_spin_unlock <-cpu_load_update_nohz
1872     <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
1873     <idle>-0       3dN.1   15us : calc_load_nohz_stop <-tick_nohz_idle_exit
1874     <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
1875     <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
1876     <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
1877     <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
1878     <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1879     <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
1880     <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
1881     <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
1882     <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
1883     <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
1884     <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
1885     <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
1886     <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
1887     <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1888     <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
1889     <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
1890     <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
1891     <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
1892     <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
1893     <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
1894     <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1895     <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
1896     <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
1897     <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
1898     <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
1899     <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
1900     <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
1901     <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
1902     <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
1903     <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1904     <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
1905     <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
1906     <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
1907     <idle>-0       3.N..   25us : schedule <-cpu_idle
1908     <idle>-0       3.N..   25us : __schedule <-preempt_schedule
1909     <idle>-0       3.N..   26us : add_preempt_count <-__schedule
1910     <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
1911     <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
1912     <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
1913     <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
1914     <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
1915     <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
1916     <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
1917     <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
1918     <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
1919     <idle>-0       3d..3   29us : __schedule <-preempt_schedule
1920     <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
1921
1922 This isn't that big of a trace, even with function tracing enabled,
1923 so I included the entire trace.
1924
1925 The interrupt went off while when the system was idle. Somewhere
1926 before task_woken_rt() was called, the NEED_RESCHED flag was set,
1927 this is indicated by the first occurrence of the 'N' flag.
1928
1929 Latency tracing and events
1930 --------------------------
1931 As function tracing can induce a much larger latency, but without
1932 seeing what happens within the latency it is hard to know what
1933 caused it. There is a middle ground, and that is with enabling
1934 events.
1935 ::
1936
1937   # echo 0 > options/function-trace
1938   # echo wakeup_rt > current_tracer
1939   # echo 1 > events/enable
1940   # echo 1 > tracing_on
1941   # echo 0 > tracing_max_latency
1942   # chrt -f 5 sleep 1
1943   # echo 0 > tracing_on
1944   # cat trace
1945   # tracer: wakeup_rt
1946   #
1947   # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1948   # --------------------------------------------------------------------
1949   # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1950   #    -----------------
1951   #    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
1952   #    -----------------
1953   #
1954   #                  _------=> CPU#
1955   #                 / _-----=> irqs-off
1956   #                | / _----=> need-resched
1957   #                || / _---=> hardirq/softirq
1958   #                ||| / _--=> preempt-depth
1959   #                |||| /     delay
1960   #  cmd     pid   ||||| time  |   caller
1961   #     \   /      |||||  \    |   /
1962     <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
1963     <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1964     <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
1965     <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
1966     <idle>-0       2.N.2    2us : power_end: cpu_id=2
1967     <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
1968     <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
1969     <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
1970     <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
1971     <idle>-0       2.N.2    5us : rcu_utilization: End context switch
1972     <idle>-0       2d..3    6us : __schedule <-schedule
1973     <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
1974
1975
1976 Hardware Latency Detector
1977 -------------------------
1978
1979 The hardware latency detector is executed by enabling the "hwlat" tracer.
1980
1981 NOTE, this tracer will affect the performance of the system as it will
1982 periodically make a CPU constantly busy with interrupts disabled.
1983 ::
1984
1985   # echo hwlat > current_tracer
1986   # sleep 100
1987   # cat trace
1988   # tracer: hwlat
1989   #
1990   #                              _-----=> irqs-off
1991   #                             / _----=> need-resched
1992   #                            | / _---=> hardirq/softirq
1993   #                            || / _--=> preempt-depth
1994   #                            ||| /     delay
1995   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1996   #              | |       |   ||||       |         |
1997              <...>-3638  [001] d... 19452.055471: #1     inner/outer(us):   12/14    ts:1499801089.066141940
1998              <...>-3638  [003] d... 19454.071354: #2     inner/outer(us):   11/9     ts:1499801091.082164365
1999              <...>-3638  [002] dn.. 19461.126852: #3     inner/outer(us):   12/9     ts:1499801098.138150062
2000              <...>-3638  [001] d... 19488.340960: #4     inner/outer(us):    8/12    ts:1499801125.354139633
2001              <...>-3638  [003] d... 19494.388553: #5     inner/outer(us):    8/12    ts:1499801131.402150961
2002              <...>-3638  [003] d... 19501.283419: #6     inner/outer(us):    0/12    ts:1499801138.297435289 nmi-total:4 nmi-count:1
2003
2004
2005 The above output is somewhat the same in the header. All events will have
2006 interrupts disabled 'd'. Under the FUNCTION title there is:
2007
2008  #1
2009         This is the count of events recorded that were greater than the
2010         tracing_threshold (See below).
2011
2012  inner/outer(us):   12/14
2013
2014       This shows two numbers as "inner latency" and "outer latency". The test
2015       runs in a loop checking a timestamp twice. The latency detected within
2016       the two timestamps is the "inner latency" and the latency detected
2017       after the previous timestamp and the next timestamp in the loop is
2018       the "outer latency".
2019
2020  ts:1499801089.066141940
2021
2022       The absolute timestamp that the event happened.
2023
2024  nmi-total:4 nmi-count:1
2025
2026       On architectures that support it, if an NMI comes in during the
2027       test, the time spent in NMI is reported in "nmi-total" (in
2028       microseconds).
2029
2030       All architectures that have NMIs will show the "nmi-count" if an
2031       NMI comes in during the test.
2032
2033 hwlat files:
2034
2035   tracing_threshold
2036         This gets automatically set to "10" to represent 10
2037         microseconds. This is the threshold of latency that
2038         needs to be detected before the trace will be recorded.
2039
2040         Note, when hwlat tracer is finished (another tracer is
2041         written into "current_tracer"), the original value for
2042         tracing_threshold is placed back into this file.
2043
2044   hwlat_detector/width
2045         The length of time the test runs with interrupts disabled.
2046
2047   hwlat_detector/window
2048         The length of time of the window which the test
2049         runs. That is, the test will run for "width"
2050         microseconds per "window" microseconds
2051
2052   tracing_cpumask
2053         When the test is started. A kernel thread is created that
2054         runs the test. This thread will alternate between CPUs
2055         listed in the tracing_cpumask between each period
2056         (one "window"). To limit the test to specific CPUs
2057         set the mask in this file to only the CPUs that the test
2058         should run on.
2059
2060 function
2061 --------
2062
2063 This tracer is the function tracer. Enabling the function tracer
2064 can be done from the debug file system. Make sure the
2065 ftrace_enabled is set; otherwise this tracer is a nop.
2066 See the "ftrace_enabled" section below.
2067 ::
2068
2069   # sysctl kernel.ftrace_enabled=1
2070   # echo function > current_tracer
2071   # echo 1 > tracing_on
2072   # usleep 1
2073   # echo 0 > tracing_on
2074   # cat trace
2075   # tracer: function
2076   #
2077   # entries-in-buffer/entries-written: 24799/24799   #P:4
2078   #
2079   #                              _-----=> irqs-off
2080   #                             / _----=> need-resched
2081   #                            | / _---=> hardirq/softirq
2082   #                            || / _--=> preempt-depth
2083   #                            ||| /     delay
2084   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2085   #              | |       |   ||||       |         |
2086               bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
2087               bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2088               bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
2089               bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
2090               bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
2091               bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
2092               bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
2093               bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
2094   [...]
2095
2096
2097 Note: function tracer uses ring buffers to store the above
2098 entries. The newest data may overwrite the oldest data.
2099 Sometimes using echo to stop the trace is not sufficient because
2100 the tracing could have overwritten the data that you wanted to
2101 record. For this reason, it is sometimes better to disable
2102 tracing directly from a program. This allows you to stop the
2103 tracing at the point that you hit the part that you are
2104 interested in. To disable the tracing directly from a C program,
2105 something like following code snippet can be used::
2106
2107         int trace_fd;
2108         [...]
2109         int main(int argc, char *argv[]) {
2110                 [...]
2111                 trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2112                 [...]
2113                 if (condition_hit()) {
2114                         write(trace_fd, "0", 1);
2115                 }
2116                 [...]
2117         }
2118
2119
2120 Single thread tracing
2121 ---------------------
2122
2123 By writing into set_ftrace_pid you can trace a
2124 single thread. For example::
2125
2126   # cat set_ftrace_pid
2127   no pid
2128   # echo 3111 > set_ftrace_pid
2129   # cat set_ftrace_pid
2130   3111
2131   # echo function > current_tracer
2132   # cat trace | head
2133   # tracer: function
2134   #
2135   #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2136   #              | |       |          |         |
2137       yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
2138       yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2139       yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2140       yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2141       yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
2142       yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
2143   # echo > set_ftrace_pid
2144   # cat trace |head
2145   # tracer: function
2146   #
2147   #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2148   #              | |       |          |         |
2149   ##### CPU 3 buffer started ####
2150       yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
2151       yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
2152       yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
2153       yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
2154       yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
2155
2156 If you want to trace a function when executing, you could use
2157 something like this simple program.
2158 ::
2159
2160         #include <stdio.h>
2161         #include <stdlib.h>
2162         #include <sys/types.h>
2163         #include <sys/stat.h>
2164         #include <fcntl.h>
2165         #include <unistd.h>
2166         #include <string.h>
2167
2168         #define _STR(x) #x
2169         #define STR(x) _STR(x)
2170         #define MAX_PATH 256
2171
2172         const char *find_tracefs(void)
2173         {
2174                static char tracefs[MAX_PATH+1];
2175                static int tracefs_found;
2176                char type[100];
2177                FILE *fp;
2178
2179                if (tracefs_found)
2180                        return tracefs;
2181
2182                if ((fp = fopen("/proc/mounts","r")) == NULL) {
2183                        perror("/proc/mounts");
2184                        return NULL;
2185                }
2186
2187                while (fscanf(fp, "%*s %"
2188                              STR(MAX_PATH)
2189                              "s %99s %*s %*d %*d\n",
2190                              tracefs, type) == 2) {
2191                        if (strcmp(type, "tracefs") == 0)
2192                                break;
2193                }
2194                fclose(fp);
2195
2196                if (strcmp(type, "tracefs") != 0) {
2197                        fprintf(stderr, "tracefs not mounted");
2198                        return NULL;
2199                }
2200
2201                strcat(tracefs, "/tracing/");
2202                tracefs_found = 1;
2203
2204                return tracefs;
2205         }
2206
2207         const char *tracing_file(const char *file_name)
2208         {
2209                static char trace_file[MAX_PATH+1];
2210                snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2211                return trace_file;
2212         }
2213
2214         int main (int argc, char **argv)
2215         {
2216                 if (argc < 1)
2217                         exit(-1);
2218
2219                 if (fork() > 0) {
2220                         int fd, ffd;
2221                         char line[64];
2222                         int s;
2223
2224                         ffd = open(tracing_file("current_tracer"), O_WRONLY);
2225                         if (ffd < 0)
2226                                 exit(-1);
2227                         write(ffd, "nop", 3);
2228
2229                         fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2230                         s = sprintf(line, "%d\n", getpid());
2231                         write(fd, line, s);
2232
2233                         write(ffd, "function", 8);
2234
2235                         close(fd);
2236                         close(ffd);
2237
2238                         execvp(argv[1], argv+1);
2239                 }
2240
2241                 return 0;
2242         }
2243
2244 Or this simple script!
2245 ::
2246
2247   #!/bin/bash
2248
2249   tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2250   echo nop > $tracefs/tracing/current_tracer
2251   echo 0 > $tracefs/tracing/tracing_on
2252   echo $$ > $tracefs/tracing/set_ftrace_pid
2253   echo function > $tracefs/tracing/current_tracer
2254   echo 1 > $tracefs/tracing/tracing_on
2255   exec "$@"
2256
2257
2258 function graph tracer
2259 ---------------------------
2260
2261 This tracer is similar to the function tracer except that it
2262 probes a function on its entry and its exit. This is done by
2263 using a dynamically allocated stack of return addresses in each
2264 task_struct. On function entry the tracer overwrites the return
2265 address of each function traced to set a custom probe. Thus the
2266 original return address is stored on the stack of return address
2267 in the task_struct.
2268
2269 Probing on both ends of a function leads to special features
2270 such as:
2271
2272 - measure of a function's time execution
2273 - having a reliable call stack to draw function calls graph
2274
2275 This tracer is useful in several situations:
2276
2277 - you want to find the reason of a strange kernel behavior and
2278   need to see what happens in detail on any areas (or specific
2279   ones).
2280
2281 - you are experiencing weird latencies but it's difficult to
2282   find its origin.
2283
2284 - you want to find quickly which path is taken by a specific
2285   function
2286
2287 - you just want to peek inside a working kernel and want to see
2288   what happens there.
2289
2290 ::
2291
2292   # tracer: function_graph
2293   #
2294   # CPU  DURATION                  FUNCTION CALLS
2295   # |     |   |                     |   |   |   |
2296
2297    0)               |  sys_open() {
2298    0)               |    do_sys_open() {
2299    0)               |      getname() {
2300    0)               |        kmem_cache_alloc() {
2301    0)   1.382 us    |          __might_sleep();
2302    0)   2.478 us    |        }
2303    0)               |        strncpy_from_user() {
2304    0)               |          might_fault() {
2305    0)   1.389 us    |            __might_sleep();
2306    0)   2.553 us    |          }
2307    0)   3.807 us    |        }
2308    0)   7.876 us    |      }
2309    0)               |      alloc_fd() {
2310    0)   0.668 us    |        _spin_lock();
2311    0)   0.570 us    |        expand_files();
2312    0)   0.586 us    |        _spin_unlock();
2313
2314
2315 There are several columns that can be dynamically
2316 enabled/disabled. You can use every combination of options you
2317 want, depending on your needs.
2318
2319 - The cpu number on which the function executed is default
2320   enabled.  It is sometimes better to only trace one cpu (see
2321   tracing_cpu_mask file) or you might sometimes see unordered
2322   function calls while cpu tracing switch.
2323
2324         - hide: echo nofuncgraph-cpu > trace_options
2325         - show: echo funcgraph-cpu > trace_options
2326
2327 - The duration (function's time of execution) is displayed on
2328   the closing bracket line of a function or on the same line
2329   than the current function in case of a leaf one. It is default
2330   enabled.
2331
2332         - hide: echo nofuncgraph-duration > trace_options
2333         - show: echo funcgraph-duration > trace_options
2334
2335 - The overhead field precedes the duration field in case of
2336   reached duration thresholds.
2337
2338         - hide: echo nofuncgraph-overhead > trace_options
2339         - show: echo funcgraph-overhead > trace_options
2340         - depends on: funcgraph-duration
2341
2342   ie::
2343
2344     3) # 1837.709 us |          } /* __switch_to */
2345     3)               |          finish_task_switch() {
2346     3)   0.313 us    |            _raw_spin_unlock_irq();
2347     3)   3.177 us    |          }
2348     3) # 1889.063 us |        } /* __schedule */
2349     3) ! 140.417 us  |      } /* __schedule */
2350     3) # 2034.948 us |    } /* schedule */
2351     3) * 33998.59 us |  } /* schedule_preempt_disabled */
2352
2353     [...]
2354
2355     1)   0.260 us    |              msecs_to_jiffies();
2356     1)   0.313 us    |              __rcu_read_unlock();
2357     1) + 61.770 us   |            }
2358     1) + 64.479 us   |          }
2359     1)   0.313 us    |          rcu_bh_qs();
2360     1)   0.313 us    |          __local_bh_enable();
2361     1) ! 217.240 us  |        }
2362     1)   0.365 us    |        idle_cpu();
2363     1)               |        rcu_irq_exit() {
2364     1)   0.417 us    |          rcu_eqs_enter_common.isra.47();
2365     1)   3.125 us    |        }
2366     1) ! 227.812 us  |      }
2367     1) ! 457.395 us  |    }
2368     1) @ 119760.2 us |  }
2369
2370     [...]
2371
2372     2)               |    handle_IPI() {
2373     1)   6.979 us    |                  }
2374     2)   0.417 us    |      scheduler_ipi();
2375     1)   9.791 us    |                }
2376     1) + 12.917 us   |              }
2377     2)   3.490 us    |    }
2378     1) + 15.729 us   |            }
2379     1) + 18.542 us   |          }
2380     2) $ 3594274 us  |  }
2381
2382 Flags::
2383
2384   + means that the function exceeded 10 usecs.
2385   ! means that the function exceeded 100 usecs.
2386   # means that the function exceeded 1000 usecs.
2387   * means that the function exceeded 10 msecs.
2388   @ means that the function exceeded 100 msecs.
2389   $ means that the function exceeded 1 sec.
2390
2391
2392 - The task/pid field displays the thread cmdline and pid which
2393   executed the function. It is default disabled.
2394
2395         - hide: echo nofuncgraph-proc > trace_options
2396         - show: echo funcgraph-proc > trace_options
2397
2398   ie::
2399
2400     # tracer: function_graph
2401     #
2402     # CPU  TASK/PID        DURATION                  FUNCTION CALLS
2403     # |    |    |           |   |                     |   |   |   |
2404     0)    sh-4802     |               |                  d_free() {
2405     0)    sh-4802     |               |                    call_rcu() {
2406     0)    sh-4802     |               |                      __call_rcu() {
2407     0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
2408     0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
2409     0)    sh-4802     |   2.899 us    |                      }
2410     0)    sh-4802     |   4.040 us    |                    }
2411     0)    sh-4802     |   5.151 us    |                  }
2412     0)    sh-4802     | + 49.370 us   |                }
2413
2414
2415 - The absolute time field is an absolute timestamp given by the
2416   system clock since it started. A snapshot of this time is
2417   given on each entry/exit of functions
2418
2419         - hide: echo nofuncgraph-abstime > trace_options
2420         - show: echo funcgraph-abstime > trace_options
2421
2422   ie::
2423
2424     #
2425     #      TIME       CPU  DURATION                  FUNCTION CALLS
2426     #       |         |     |   |                     |   |   |   |
2427     360.774522 |   1)   0.541 us    |                                          }
2428     360.774522 |   1)   4.663 us    |                                        }
2429     360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
2430     360.774524 |   1)   6.796 us    |                                      }
2431     360.774524 |   1)   7.952 us    |                                    }
2432     360.774525 |   1)   9.063 us    |                                  }
2433     360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
2434     360.774527 |   1)   0.578 us    |                                  __brelse();
2435     360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
2436     360.774528 |   1)               |                                    unlock_buffer() {
2437     360.774529 |   1)               |                                      wake_up_bit() {
2438     360.774529 |   1)               |                                        bit_waitqueue() {
2439     360.774530 |   1)   0.594 us    |                                          __phys_addr();
2440
2441
2442 The function name is always displayed after the closing bracket
2443 for a function if the start of that function is not in the
2444 trace buffer.
2445
2446 Display of the function name after the closing bracket may be
2447 enabled for functions whose start is in the trace buffer,
2448 allowing easier searching with grep for function durations.
2449 It is default disabled.
2450
2451         - hide: echo nofuncgraph-tail > trace_options
2452         - show: echo funcgraph-tail > trace_options
2453
2454   Example with nofuncgraph-tail (default)::
2455
2456     0)               |      putname() {
2457     0)               |        kmem_cache_free() {
2458     0)   0.518 us    |          __phys_addr();
2459     0)   1.757 us    |        }
2460     0)   2.861 us    |      }
2461
2462   Example with funcgraph-tail::
2463
2464     0)               |      putname() {
2465     0)               |        kmem_cache_free() {
2466     0)   0.518 us    |          __phys_addr();
2467     0)   1.757 us    |        } /* kmem_cache_free() */
2468     0)   2.861 us    |      } /* putname() */
2469
2470 You can put some comments on specific functions by using
2471 trace_printk() For example, if you want to put a comment inside
2472 the __might_sleep() function, you just have to include
2473 <linux/ftrace.h> and call trace_printk() inside __might_sleep()::
2474
2475         trace_printk("I'm a comment!\n")
2476
2477 will produce::
2478
2479    1)               |             __might_sleep() {
2480    1)               |                /* I'm a comment! */
2481    1)   1.449 us    |             }
2482
2483
2484 You might find other useful features for this tracer in the
2485 following "dynamic ftrace" section such as tracing only specific
2486 functions or tasks.
2487
2488 dynamic ftrace
2489 --------------
2490
2491 If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2492 virtually no overhead when function tracing is disabled. The way
2493 this works is the mcount function call (placed at the start of
2494 every kernel function, produced by the -pg switch in gcc),
2495 starts of pointing to a simple return. (Enabling FTRACE will
2496 include the -pg switch in the compiling of the kernel.)
2497
2498 At compile time every C file object is run through the
2499 recordmcount program (located in the scripts directory). This
2500 program will parse the ELF headers in the C object to find all
2501 the locations in the .text section that call mcount. Starting
2502 with gcc verson 4.6, the -mfentry has been added for x86, which
2503 calls "__fentry__" instead of "mcount". Which is called before
2504 the creation of the stack frame.
2505
2506 Note, not all sections are traced. They may be prevented by either
2507 a notrace, or blocked another way and all inline functions are not
2508 traced. Check the "available_filter_functions" file to see what functions
2509 can be traced.
2510
2511 A section called "__mcount_loc" is created that holds
2512 references to all the mcount/fentry call sites in the .text section.
2513 The recordmcount program re-links this section back into the
2514 original object. The final linking stage of the kernel will add all these
2515 references into a single table.
2516
2517 On boot up, before SMP is initialized, the dynamic ftrace code
2518 scans this table and updates all the locations into nops. It
2519 also records the locations, which are added to the
2520 available_filter_functions list.  Modules are processed as they
2521 are loaded and before they are executed.  When a module is
2522 unloaded, it also removes its functions from the ftrace function
2523 list. This is automatic in the module unload code, and the
2524 module author does not need to worry about it.
2525
2526 When tracing is enabled, the process of modifying the function
2527 tracepoints is dependent on architecture. The old method is to use
2528 kstop_machine to prevent races with the CPUs executing code being
2529 modified (which can cause the CPU to do undesirable things, especially
2530 if the modified code crosses cache (or page) boundaries), and the nops are
2531 patched back to calls. But this time, they do not call mcount
2532 (which is just a function stub). They now call into the ftrace
2533 infrastructure.
2534
2535 The new method of modifying the function tracepoints is to place
2536 a breakpoint at the location to be modified, sync all CPUs, modify
2537 the rest of the instruction not covered by the breakpoint. Sync
2538 all CPUs again, and then remove the breakpoint with the finished
2539 version to the ftrace call site.
2540
2541 Some archs do not even need to monkey around with the synchronization,
2542 and can just slap the new code on top of the old without any
2543 problems with other CPUs executing it at the same time.
2544
2545 One special side-effect to the recording of the functions being
2546 traced is that we can now selectively choose which functions we
2547 wish to trace and which ones we want the mcount calls to remain
2548 as nops.
2549
2550 Two files are used, one for enabling and one for disabling the
2551 tracing of specified functions. They are:
2552
2553   set_ftrace_filter
2554
2555 and
2556
2557   set_ftrace_notrace
2558
2559 A list of available functions that you can add to these files is
2560 listed in:
2561
2562    available_filter_functions
2563
2564 ::
2565
2566   # cat available_filter_functions
2567   put_prev_task_idle
2568   kmem_cache_create
2569   pick_next_task_rt
2570   get_online_cpus
2571   pick_next_task_fair
2572   mutex_lock
2573   [...]
2574
2575 If I am only interested in sys_nanosleep and hrtimer_interrupt::
2576
2577   # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2578   # echo function > current_tracer
2579   # echo 1 > tracing_on
2580   # usleep 1
2581   # echo 0 > tracing_on
2582   # cat trace
2583   # tracer: function
2584   #
2585   # entries-in-buffer/entries-written: 5/5   #P:4
2586   #
2587   #                              _-----=> irqs-off
2588   #                             / _----=> need-resched
2589   #                            | / _---=> hardirq/softirq
2590   #                            || / _--=> preempt-depth
2591   #                            ||| /     delay
2592   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2593   #              | |       |   ||||       |         |
2594             usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
2595             <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2596             usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2597             <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2598             <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2599
2600 To see which functions are being traced, you can cat the file:
2601 ::
2602
2603   # cat set_ftrace_filter
2604   hrtimer_interrupt
2605   sys_nanosleep
2606
2607
2608 Perhaps this is not enough. The filters also allow glob(7) matching.
2609
2610   ``<match>*``
2611         will match functions that begin with <match>
2612   ``*<match>``
2613         will match functions that end with <match>
2614   ``*<match>*``
2615         will match functions that have <match> in it
2616   ``<match1>*<match2>``
2617         will match functions that begin with <match1> and end with <match2>
2618
2619 .. note::
2620       It is better to use quotes to enclose the wild cards,
2621       otherwise the shell may expand the parameters into names
2622       of files in the local directory.
2623
2624 ::
2625
2626   # echo 'hrtimer_*' > set_ftrace_filter
2627
2628 Produces::
2629
2630   # tracer: function
2631   #
2632   # entries-in-buffer/entries-written: 897/897   #P:4
2633   #
2634   #                              _-----=> irqs-off
2635   #                             / _----=> need-resched
2636   #                            | / _---=> hardirq/softirq
2637   #                            || / _--=> preempt-depth
2638   #                            ||| /     delay
2639   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2640   #              | |       |   ||||       |         |
2641             <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2642             <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2643             <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2644             <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2645             <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2646             <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2647             <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2648             <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
2649
2650 Notice that we lost the sys_nanosleep.
2651 ::
2652
2653   # cat set_ftrace_filter
2654   hrtimer_run_queues
2655   hrtimer_run_pending
2656   hrtimer_init
2657   hrtimer_cancel
2658   hrtimer_try_to_cancel
2659   hrtimer_forward
2660   hrtimer_start
2661   hrtimer_reprogram
2662   hrtimer_force_reprogram
2663   hrtimer_get_next_event
2664   hrtimer_interrupt
2665   hrtimer_nanosleep
2666   hrtimer_wakeup
2667   hrtimer_get_remaining
2668   hrtimer_get_res
2669   hrtimer_init_sleeper
2670
2671
2672 This is because the '>' and '>>' act just like they do in bash.
2673 To rewrite the filters, use '>'
2674 To append to the filters, use '>>'
2675
2676 To clear out a filter so that all functions will be recorded
2677 again::
2678
2679  # echo > set_ftrace_filter
2680  # cat set_ftrace_filter
2681  #
2682
2683 Again, now we want to append.
2684
2685 ::
2686
2687   # echo sys_nanosleep > set_ftrace_filter
2688   # cat set_ftrace_filter
2689   sys_nanosleep
2690   # echo 'hrtimer_*' >> set_ftrace_filter
2691   # cat set_ftrace_filter
2692   hrtimer_run_queues
2693   hrtimer_run_pending
2694   hrtimer_init
2695   hrtimer_cancel
2696   hrtimer_try_to_cancel
2697   hrtimer_forward
2698   hrtimer_start
2699   hrtimer_reprogram
2700   hrtimer_force_reprogram
2701   hrtimer_get_next_event
2702   hrtimer_interrupt
2703   sys_nanosleep
2704   hrtimer_nanosleep
2705   hrtimer_wakeup
2706   hrtimer_get_remaining
2707   hrtimer_get_res
2708   hrtimer_init_sleeper
2709
2710
2711 The set_ftrace_notrace prevents those functions from being
2712 traced.
2713 ::
2714
2715   # echo '*preempt*' '*lock*' > set_ftrace_notrace
2716
2717 Produces::
2718
2719   # tracer: function
2720   #
2721   # entries-in-buffer/entries-written: 39608/39608   #P:4
2722   #
2723   #                              _-----=> irqs-off
2724   #                             / _----=> need-resched
2725   #                            | / _---=> hardirq/softirq
2726   #                            || / _--=> preempt-depth
2727   #                            ||| /     delay
2728   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2729   #              | |       |   ||||       |         |
2730               bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
2731               bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
2732               bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
2733               bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
2734               bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
2735               bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
2736               bash-1994  [000] ....  4342.324899: do_truncate <-do_last
2737               bash-1994  [000] ....  4342.324899: should_remove_suid <-do_truncate
2738               bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
2739               bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
2740               bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
2741               bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
2742
2743 We can see that there's no more lock or preempt tracing.
2744
2745
2746 Dynamic ftrace with the function graph tracer
2747 ---------------------------------------------
2748
2749 Although what has been explained above concerns both the
2750 function tracer and the function-graph-tracer, there are some
2751 special features only available in the function-graph tracer.
2752
2753 If you want to trace only one function and all of its children,
2754 you just have to echo its name into set_graph_function::
2755
2756  echo __do_fault > set_graph_function
2757
2758 will produce the following "expanded" trace of the __do_fault()
2759 function::
2760
2761    0)               |  __do_fault() {
2762    0)               |    filemap_fault() {
2763    0)               |      find_lock_page() {
2764    0)   0.804 us    |        find_get_page();
2765    0)               |        __might_sleep() {
2766    0)   1.329 us    |        }
2767    0)   3.904 us    |      }
2768    0)   4.979 us    |    }
2769    0)   0.653 us    |    _spin_lock();
2770    0)   0.578 us    |    page_add_file_rmap();
2771    0)   0.525 us    |    native_set_pte_at();
2772    0)   0.585 us    |    _spin_unlock();
2773    0)               |    unlock_page() {
2774    0)   0.541 us    |      page_waitqueue();
2775    0)   0.639 us    |      __wake_up_bit();
2776    0)   2.786 us    |    }
2777    0) + 14.237 us   |  }
2778    0)               |  __do_fault() {
2779    0)               |    filemap_fault() {
2780    0)               |      find_lock_page() {
2781    0)   0.698 us    |        find_get_page();
2782    0)               |        __might_sleep() {
2783    0)   1.412 us    |        }
2784    0)   3.950 us    |      }
2785    0)   5.098 us    |    }
2786    0)   0.631 us    |    _spin_lock();
2787    0)   0.571 us    |    page_add_file_rmap();
2788    0)   0.526 us    |    native_set_pte_at();
2789    0)   0.586 us    |    _spin_unlock();
2790    0)               |    unlock_page() {
2791    0)   0.533 us    |      page_waitqueue();
2792    0)   0.638 us    |      __wake_up_bit();
2793    0)   2.793 us    |    }
2794    0) + 14.012 us   |  }
2795
2796 You can also expand several functions at once::
2797
2798  echo sys_open > set_graph_function
2799  echo sys_close >> set_graph_function
2800
2801 Now if you want to go back to trace all functions you can clear
2802 this special filter via::
2803
2804  echo > set_graph_function
2805
2806
2807 ftrace_enabled
2808 --------------
2809
2810 Note, the proc sysctl ftrace_enable is a big on/off switch for the
2811 function tracer. By default it is enabled (when function tracing is
2812 enabled in the kernel). If it is disabled, all function tracing is
2813 disabled. This includes not only the function tracers for ftrace, but
2814 also for any other uses (perf, kprobes, stack tracing, profiling, etc).
2815
2816 Please disable this with care.
2817
2818 This can be disable (and enabled) with::
2819
2820   sysctl kernel.ftrace_enabled=0
2821   sysctl kernel.ftrace_enabled=1
2822
2823  or
2824
2825   echo 0 > /proc/sys/kernel/ftrace_enabled
2826   echo 1 > /proc/sys/kernel/ftrace_enabled
2827
2828
2829 Filter commands
2830 ---------------
2831
2832 A few commands are supported by the set_ftrace_filter interface.
2833 Trace commands have the following format::
2834
2835   <function>:<command>:<parameter>
2836
2837 The following commands are supported:
2838
2839 - mod:
2840   This command enables function filtering per module. The
2841   parameter defines the module. For example, if only the write*
2842   functions in the ext3 module are desired, run:
2843
2844    echo 'write*:mod:ext3' > set_ftrace_filter
2845
2846   This command interacts with the filter in the same way as
2847   filtering based on function names. Thus, adding more functions
2848   in a different module is accomplished by appending (>>) to the
2849   filter file. Remove specific module functions by prepending
2850   '!'::
2851
2852    echo '!writeback*:mod:ext3' >> set_ftrace_filter
2853
2854   Mod command supports module globbing. Disable tracing for all
2855   functions except a specific module::
2856
2857    echo '!*:mod:!ext3' >> set_ftrace_filter
2858
2859   Disable tracing for all modules, but still trace kernel::
2860
2861    echo '!*:mod:*' >> set_ftrace_filter
2862
2863   Enable filter only for kernel::
2864
2865    echo '*write*:mod:!*' >> set_ftrace_filter
2866
2867   Enable filter for module globbing::
2868
2869    echo '*write*:mod:*snd*' >> set_ftrace_filter
2870
2871 - traceon/traceoff:
2872   These commands turn tracing on and off when the specified
2873   functions are hit. The parameter determines how many times the
2874   tracing system is turned on and off. If unspecified, there is
2875   no limit. For example, to disable tracing when a schedule bug
2876   is hit the first 5 times, run::
2877
2878    echo '__schedule_bug:traceoff:5' > set_ftrace_filter
2879
2880   To always disable tracing when __schedule_bug is hit::
2881
2882    echo '__schedule_bug:traceoff' > set_ftrace_filter
2883
2884   These commands are cumulative whether or not they are appended
2885   to set_ftrace_filter. To remove a command, prepend it by '!'
2886   and drop the parameter::
2887
2888    echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
2889
2890   The above removes the traceoff command for __schedule_bug
2891   that have a counter. To remove commands without counters::
2892
2893    echo '!__schedule_bug:traceoff' > set_ftrace_filter
2894
2895 - snapshot:
2896   Will cause a snapshot to be triggered when the function is hit.
2897   ::
2898
2899    echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
2900
2901   To only snapshot once:
2902   ::
2903
2904    echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
2905
2906   To remove the above commands::
2907
2908    echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
2909    echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
2910
2911 - enable_event/disable_event:
2912   These commands can enable or disable a trace event. Note, because
2913   function tracing callbacks are very sensitive, when these commands
2914   are registered, the trace point is activated, but disabled in
2915   a "soft" mode. That is, the tracepoint will be called, but
2916   just will not be traced. The event tracepoint stays in this mode
2917   as long as there's a command that triggers it.
2918   ::
2919
2920    echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
2921          set_ftrace_filter
2922
2923   The format is::
2924
2925     <function>:enable_event:<system>:<event>[:count]
2926     <function>:disable_event:<system>:<event>[:count]
2927
2928   To remove the events commands::
2929
2930    echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
2931          set_ftrace_filter
2932    echo '!schedule:disable_event:sched:sched_switch' > \
2933          set_ftrace_filter
2934
2935 - dump:
2936   When the function is hit, it will dump the contents of the ftrace
2937   ring buffer to the console. This is useful if you need to debug
2938   something, and want to dump the trace when a certain function
2939   is hit. Perhaps its a function that is called before a tripple
2940   fault happens and does not allow you to get a regular dump.
2941
2942 - cpudump:
2943   When the function is hit, it will dump the contents of the ftrace
2944   ring buffer for the current CPU to the console. Unlike the "dump"
2945   command, it only prints out the contents of the ring buffer for the
2946   CPU that executed the function that triggered the dump.
2947
2948 trace_pipe
2949 ----------
2950
2951 The trace_pipe outputs the same content as the trace file, but
2952 the effect on the tracing is different. Every read from
2953 trace_pipe is consumed. This means that subsequent reads will be
2954 different. The trace is live.
2955 ::
2956
2957   # echo function > current_tracer
2958   # cat trace_pipe > /tmp/trace.out &
2959   [1] 4153
2960   # echo 1 > tracing_on
2961   # usleep 1
2962   # echo 0 > tracing_on
2963   # cat trace
2964   # tracer: function
2965   #
2966   # entries-in-buffer/entries-written: 0/0   #P:4
2967   #
2968   #                              _-----=> irqs-off
2969   #                             / _----=> need-resched
2970   #                            | / _---=> hardirq/softirq
2971   #                            || / _--=> preempt-depth
2972   #                            ||| /     delay
2973   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2974   #              | |       |   ||||       |         |
2975
2976   #
2977   # cat /tmp/trace.out
2978              bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
2979              bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
2980              bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
2981              bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
2982              bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
2983              bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
2984              bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
2985              bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
2986              bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
2987
2988
2989 Note, reading the trace_pipe file will block until more input is
2990 added.
2991
2992 trace entries
2993 -------------
2994
2995 Having too much or not enough data can be troublesome in
2996 diagnosing an issue in the kernel. The file buffer_size_kb is
2997 used to modify the size of the internal trace buffers. The
2998 number listed is the number of entries that can be recorded per
2999 CPU. To know the full size, multiply the number of possible CPUs
3000 with the number of entries.
3001 ::
3002
3003   # cat buffer_size_kb
3004   1408 (units kilobytes)
3005
3006 Or simply read buffer_total_size_kb
3007 ::
3008
3009   # cat buffer_total_size_kb
3010   5632
3011
3012 To modify the buffer, simple echo in a number (in 1024 byte segments).
3013 ::
3014
3015   # echo 10000 > buffer_size_kb
3016   # cat buffer_size_kb
3017   10000 (units kilobytes)
3018
3019 It will try to allocate as much as possible. If you allocate too
3020 much, it can cause Out-Of-Memory to trigger.
3021 ::
3022
3023   # echo 1000000000000 > buffer_size_kb
3024   -bash: echo: write error: Cannot allocate memory
3025   # cat buffer_size_kb
3026   85
3027
3028 The per_cpu buffers can be changed individually as well:
3029 ::
3030
3031   # echo 10000 > per_cpu/cpu0/buffer_size_kb
3032   # echo 100 > per_cpu/cpu1/buffer_size_kb
3033
3034 When the per_cpu buffers are not the same, the buffer_size_kb
3035 at the top level will just show an X
3036 ::
3037
3038   # cat buffer_size_kb
3039   X
3040
3041 This is where the buffer_total_size_kb is useful:
3042 ::
3043
3044   # cat buffer_total_size_kb
3045   12916
3046
3047 Writing to the top level buffer_size_kb will reset all the buffers
3048 to be the same again.
3049
3050 Snapshot
3051 --------
3052 CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
3053 available to all non latency tracers. (Latency tracers which
3054 record max latency, such as "irqsoff" or "wakeup", can't use
3055 this feature, since those are already using the snapshot
3056 mechanism internally.)
3057
3058 Snapshot preserves a current trace buffer at a particular point
3059 in time without stopping tracing. Ftrace swaps the current
3060 buffer with a spare buffer, and tracing continues in the new
3061 current (=previous spare) buffer.
3062
3063 The following tracefs files in "tracing" are related to this
3064 feature:
3065
3066   snapshot:
3067
3068         This is used to take a snapshot and to read the output
3069         of the snapshot. Echo 1 into this file to allocate a
3070         spare buffer and to take a snapshot (swap), then read
3071         the snapshot from this file in the same format as
3072         "trace" (described above in the section "The File
3073         System"). Both reads snapshot and tracing are executable
3074         in parallel. When the spare buffer is allocated, echoing
3075         0 frees it, and echoing else (positive) values clear the
3076         snapshot contents.
3077         More details are shown in the table below.
3078
3079         +--------------+------------+------------+------------+
3080         |status\\input |     0      |     1      |    else    |
3081         +==============+============+============+============+
3082         |not allocated |(do nothing)| alloc+swap |(do nothing)|
3083         +--------------+------------+------------+------------+
3084         |allocated     |    free    |    swap    |   clear    |
3085         +--------------+------------+------------+------------+
3086
3087 Here is an example of using the snapshot feature.
3088 ::
3089
3090   # echo 1 > events/sched/enable
3091   # echo 1 > snapshot
3092   # cat snapshot
3093   # tracer: nop
3094   #
3095   # entries-in-buffer/entries-written: 71/71   #P:8
3096   #
3097   #                              _-----=> irqs-off
3098   #                             / _----=> need-resched
3099   #                            | / _---=> hardirq/softirq
3100   #                            || / _--=> preempt-depth
3101   #                            ||| /     delay
3102   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3103   #              | |       |   ||||       |         |
3104             <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120   prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3105              sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120   prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3106   [...]
3107           <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3108
3109   # cat trace
3110   # tracer: nop
3111   #
3112   # entries-in-buffer/entries-written: 77/77   #P:8
3113   #
3114   #                              _-----=> irqs-off
3115   #                             / _----=> need-resched
3116   #                            | / _---=> hardirq/softirq
3117   #                            || / _--=> preempt-depth
3118   #                            ||| /     delay
3119   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3120   #              | |       |   ||||       |         |
3121             <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3122    snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3123   [...]
3124
3125
3126 If you try to use this snapshot feature when current tracer is
3127 one of the latency tracers, you will get the following results.
3128 ::
3129
3130   # echo wakeup > current_tracer
3131   # echo 1 > snapshot
3132   bash: echo: write error: Device or resource busy
3133   # cat snapshot
3134   cat: snapshot: Device or resource busy
3135
3136
3137 Instances
3138 ---------
3139 In the tracefs tracing directory is a directory called "instances".
3140 This directory can have new directories created inside of it using
3141 mkdir, and removing directories with rmdir. The directory created
3142 with mkdir in this directory will already contain files and other
3143 directories after it is created.
3144 ::
3145
3146   # mkdir instances/foo
3147   # ls instances/foo
3148   buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
3149   set_event  snapshot  trace  trace_clock  trace_marker  trace_options
3150   trace_pipe  tracing_on
3151
3152 As you can see, the new directory looks similar to the tracing directory
3153 itself. In fact, it is very similar, except that the buffer and
3154 events are agnostic from the main director, or from any other
3155 instances that are created.
3156
3157 The files in the new directory work just like the files with the
3158 same name in the tracing directory except the buffer that is used
3159 is a separate and new buffer. The files affect that buffer but do not
3160 affect the main buffer with the exception of trace_options. Currently,
3161 the trace_options affect all instances and the top level buffer
3162 the same, but this may change in future releases. That is, options
3163 may become specific to the instance they reside in.
3164
3165 Notice that none of the function tracer files are there, nor is
3166 current_tracer and available_tracers. This is because the buffers
3167 can currently only have events enabled for them.
3168 ::
3169
3170   # mkdir instances/foo
3171   # mkdir instances/bar
3172   # mkdir instances/zoot
3173   # echo 100000 > buffer_size_kb
3174   # echo 1000 > instances/foo/buffer_size_kb
3175   # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3176   # echo function > current_trace
3177   # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3178   # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3179   # echo 1 > instances/foo/events/sched/sched_switch/enable
3180   # echo 1 > instances/bar/events/irq/enable
3181   # echo 1 > instances/zoot/events/syscalls/enable
3182   # cat trace_pipe
3183   CPU:2 [LOST 11745 EVENTS]
3184               bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3185               bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3186               bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3187               bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3188               bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3189               bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3190               bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3191               bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3192               bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3193               bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3194               bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3195   [...]
3196
3197   # cat instances/foo/trace_pipe
3198               bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3199               bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3200             <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3201             <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3202        rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3203               bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3204               bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3205               bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3206        kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3207        kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3208   [...]
3209
3210   # cat instances/bar/trace_pipe
3211        migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
3212             <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
3213               bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
3214               bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
3215               bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
3216               bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
3217               bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
3218               bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
3219               sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3220               sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
3221               sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
3222               sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
3223   [...]
3224
3225   # cat instances/zoot/trace
3226   # tracer: nop
3227   #
3228   # entries-in-buffer/entries-written: 18996/18996   #P:4
3229   #
3230   #                              _-----=> irqs-off
3231   #                             / _----=> need-resched
3232   #                            | / _---=> hardirq/softirq
3233   #                            || / _--=> preempt-depth
3234   #                            ||| /     delay
3235   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3236   #              | |       |   ||||       |         |
3237               bash-1998  [000] d...   140.733501: sys_write -> 0x2
3238               bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
3239               bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
3240               bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3241               bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
3242               bash-1998  [000] d...   140.733510: sys_close(fd: a)
3243               bash-1998  [000] d...   140.733510: sys_close -> 0x0
3244               bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3245               bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
3246               bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3247               bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
3248
3249 You can see that the trace of the top most trace buffer shows only
3250 the function tracing. The foo instance displays wakeups and task
3251 switches.
3252
3253 To remove the instances, simply delete their directories:
3254 ::
3255
3256   # rmdir instances/foo
3257   # rmdir instances/bar
3258   # rmdir instances/zoot
3259
3260 Note, if a process has a trace file open in one of the instance
3261 directories, the rmdir will fail with EBUSY.
3262
3263
3264 Stack trace
3265 -----------
3266 Since the kernel has a fixed sized stack, it is important not to
3267 waste it in functions. A kernel developer must be conscience of
3268 what they allocate on the stack. If they add too much, the system
3269 can be in danger of a stack overflow, and corruption will occur,
3270 usually leading to a system panic.
3271
3272 There are some tools that check this, usually with interrupts
3273 periodically checking usage. But if you can perform a check
3274 at every function call that will become very useful. As ftrace provides
3275 a function tracer, it makes it convenient to check the stack size
3276 at every function call. This is enabled via the stack tracer.
3277
3278 CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3279 To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3280 ::
3281
3282  # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3283
3284 You can also enable it from the kernel command line to trace
3285 the stack size of the kernel during boot up, by adding "stacktrace"
3286 to the kernel command line parameter.
3287
3288 After running it for a few minutes, the output looks like:
3289 ::
3290
3291   # cat stack_max_size
3292   2928
3293
3294   # cat stack_trace
3295           Depth    Size   Location    (18 entries)
3296           -----    ----   --------
3297     0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
3298     1)     2704     160   find_busiest_group+0x31/0x1f1
3299     2)     2544     256   load_balance+0xd9/0x662
3300     3)     2288      80   idle_balance+0xbb/0x130
3301     4)     2208     128   __schedule+0x26e/0x5b9
3302     5)     2080      16   schedule+0x64/0x66
3303     6)     2064     128   schedule_timeout+0x34/0xe0
3304     7)     1936     112   wait_for_common+0x97/0xf1
3305     8)     1824      16   wait_for_completion+0x1d/0x1f
3306     9)     1808     128   flush_work+0xfe/0x119
3307    10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
3308    11)     1664      48   input_available_p+0x1d/0x5c
3309    12)     1616      48   n_tty_poll+0x6d/0x134
3310    13)     1568      64   tty_poll+0x64/0x7f
3311    14)     1504     880   do_select+0x31e/0x511
3312    15)      624     400   core_sys_select+0x177/0x216
3313    16)      224      96   sys_select+0x91/0xb9
3314    17)      128     128   system_call_fastpath+0x16/0x1b
3315
3316 Note, if -mfentry is being used by gcc, functions get traced before
3317 they set up the stack frame. This means that leaf level functions
3318 are not tested by the stack tracer when -mfentry is used.
3319
3320 Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3321
3322 More
3323 ----
3324 More details can be found in the source code, in the `kernel/trace/*.c` files.