Documentation/accounting/taskstats.rst

   1 =============================
   2 Per-task statistics interface
   3 =============================
   4
   5
   6 Taskstats is a netlink-based interface for sending per-task and
   7 per-process statistics from the kernel to userspace.
   8
   9 Taskstats was designed for the following benefits:
  10
  11 - efficiently provide statistics during lifetime of a task and on its exit
  12 - unified interface for multiple accounting subsystems
  13 - extensibility for use by future accounting patches
  14
  15 Terminology
  16 -----------
  17
  18 "pid", "tid" and "task" are used interchangeably and refer to the standard
  19 Linux task defined by struct task_struct.  per-pid stats are the same as
  20 per-task stats.
  21
  22 "tgid", "process" and "thread group" are used interchangeably and refer to the
  23 tasks that share an mm_struct i.e. the traditional Unix process. Despite the
  24 use of tgid, there is no special treatment for the task that is thread group
  25 leader - a process is deemed alive as long as it has any task belonging to it.
  26
  27 Usage
  28 -----
  29
  30 To get statistics during a task's lifetime, userspace opens a unicast netlink
  31 socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
  32 The response contains statistics for a task (if pid is specified) or the sum of
  33 statistics for all tasks of the process (if tgid is specified).
  34
  35 To obtain statistics for tasks which are exiting, the userspace listener
  36 sends a register command and specifies a cpumask. Whenever a task exits on
  37 one of the cpus in the cpumask, its per-pid statistics are sent to the
  38 registered listener. Using cpumasks allows the data received by one listener
  39 to be limited and assists in flow control over the netlink interface and is
  40 explained in more detail below.
  41
  42 If the exiting task is the last thread exiting its thread group,
  43 an additional record containing the per-tgid stats is also sent to userspace.
  44 The latter contains the sum of per-pid stats for all threads in the thread
  45 group, both past and present.
  46
  47 getdelays.c is a simple utility demonstrating usage of the taskstats interface
  48 for reporting delay accounting statistics. Users can register cpumasks,
  49 send commands and process responses, listen for per-tid/tgid exit data,
  50 write the data received to a file and do basic flow control by increasing
  51 receive buffer sizes.
  52
  53 Interface
  54 ---------
  55
  56 The user-kernel interface is encapsulated in include/linux/taskstats.h
  57
  58 To avoid this documentation becoming obsolete as the interface evolves, only
  59 an outline of the current version is given. taskstats.h always overrides the
  60 description here.
  61
  62 struct taskstats is the common accounting structure for both per-pid and
  63 per-tgid data. It is versioned and can be extended by each accounting subsystem
  64 that is added to the kernel. The fields and their semantics are defined in the
  65 taskstats.h file.
  66
  67 The data exchanged between user and kernel space is a netlink message belonging
  68 to the NETLINK_GENERIC family and using the netlink attributes interface.
  69 The messages are in the format::
  70
  71     +----------+- - -+-------------+-------------------+
  72     | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
  73     +----------+- - -+-------------+-------------------+
  74
  75
  76 The taskstats payload is one of the following three kinds:
  77
  78 1. Commands: Sent from user to kernel. Commands to get data on
  79 a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
  80 containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
  81 the task/process for which userspace wants statistics.
  82
  83 Commands to register/deregister interest in exit data from a set of cpus
  84 consist of one attribute, of type
  85 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
  86 attribute payload. The cpumask is specified as an ascii string of
  87 comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
  88 the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
  89 in cpus before closing the listening socket, the kernel cleans up its interest
  90 set over time. However, for the sake of efficiency, an explicit deregistration
  91 is advisable.
  92
  93 2. Response for a command: sent from the kernel in response to a userspace
  94 command. The payload is a series of three attributes of type:
  95
  96 a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
  97 a pid/tgid will be followed by some stats.
  98
  99 b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
 100 are being returned.
 101
 102 c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
 103 same structure is used for both per-pid and per-tgid stats.
 104
 105 3. New message sent by kernel whenever a task exits. The payload consists of a
 106    series of attributes of the following type:
 107
 108 a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
 109 b) TASKSTATS_TYPE_PID: contains exiting task's pid
 110 c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
 111 d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
 112 e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
 113 f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
 114
 115
 116 per-tgid stats
 117 --------------
 118
 119 Taskstats provides per-process stats, in addition to per-task stats, since
 120 resource management is often done at a process granularity and aggregating task
 121 stats in userspace alone is inefficient and potentially inaccurate (due to lack
 122 of atomicity).
 123
 124 However, maintaining per-process, in addition to per-task stats, within the
 125 kernel has space and time overheads. To address this, the taskstats code
 126 accumulates each exiting task's statistics into a process-wide data structure.
 127 When the last task of a process exits, the process level data accumulated also
 128 gets sent to userspace (along with the per-task data).
 129
 130 When a user queries to get per-tgid data, the sum of all other live threads in
 131 the group is added up and added to the accumulated total for previously exited
 132 threads of the same thread group.
 133
 134 Extending taskstats
 135 -------------------
 136
 137 There are two ways to extend the taskstats interface to export more
 138 per-task/process stats as patches to collect them get added to the kernel
 139 in future:
 140
 141 1. Adding more fields to the end of the existing struct taskstats. Backward
 142    compatibility is ensured by the version number within the
 143    structure. Userspace will use only the fields of the struct that correspond
 144    to the version its using.
 145
 146 2. Defining separate statistic structs and using the netlink attributes
 147    interface to return them. Since userspace processes each netlink attribute
 148    independently, it can always ignore attributes whose type it does not
 149    understand (because it is using an older version of the interface).
 150
 151
 152 Choosing between 1. and 2. is a matter of trading off flexibility and
 153 overhead. If only a few fields need to be added, then 1. is the preferable
 154 path since the kernel and userspace don't need to incur the overhead of
 155 processing new netlink attributes. But if the new fields expand the existing
 156 struct too much, requiring disparate userspace accounting utilities to
 157 unnecessarily receive large structures whose fields are of no interest, then
 158 extending the attributes structure would be worthwhile.
 159
 160 Flow control for taskstats
 161 --------------------------
 162
 163 When the rate of task exits becomes large, a listener may not be able to keep
 164 up with the kernel's rate of sending per-tid/tgid exit data leading to data
 165 loss. This possibility gets compounded when the taskstats structure gets
 166 extended and the number of cpus grows large.
 167
 168 To avoid losing statistics, userspace should do one or more of the following:
 169
 170 - increase the receive buffer sizes for the netlink sockets opened by
 171   listeners to receive exit data.
 172
 173 - create more listeners and reduce the number of cpus being listened to by
 174   each listener. In the extreme case, there could be one listener for each cpu.
 175   Users may also consider setting the cpu affinity of the listener to the subset
 176   of cpus to which it listens, especially if they are listening to just one cpu.
 177
 178 Despite these measures, if the userspace receives ENOBUFS error messages
 179 indicated overflow of receive buffers, it should take measures to handle the
 180 loss of data.