Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
[sfrench/cifs-2.6.git] / Documentation / networking / devlink-health.txt
1 The health mechanism is targeted for Real Time Alerting, in order to know when
2 something bad had happened to a PCI device
3 - Provide alert debug information
4 - Self healing
5 - If problem needs vendor support, provide a way to gather all needed debugging
6   information.
7
8 The main idea is to unify and centralize driver health reports in the
9 generic devlink instance and allow the user to set different
10 attributes of the health reporting and recovery procedures.
11
12 The devlink health reporter:
13 Device driver creates a "health reporter" per each error/health type.
14 Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
15 or unknown (driver specific).
16 For each registered health reporter a driver can issue error/health reports
17 asynchronously. All health reports handling is done by devlink.
18 Device driver can provide specific callbacks for each "health reporter", e.g.
19  - Recovery procedures
20  - Diagnostics and object dump procedures
21  - OOB initial parameters
22 Different parts of the driver can register different types of health reporters
23 with different handlers.
24
25 Once an error is reported, devlink health will do the following actions:
26   * A log is being send to the kernel trace events buffer
27   * Health status and statistics are being updated for the reporter instance
28   * Object dump is being taken and saved at the reporter instance (as long as
29     there is no other dump which is already stored)
30   * Auto recovery attempt is being done. Depends on:
31     - Auto-recovery configuration
32     - Grace period vs. time passed since last recover
33
34 The user interface:
35 User can access/change each reporter's parameters and driver specific callbacks
36 via devlink, e.g per error type (per health reporter)
37  - Configure reporter's generic parameters (like: disable/enable auto recovery)
38  - Invoke recovery procedure
39  - Run diagnostics
40  - Object dump
41
42 The devlink health interface (via netlink):
43 DEVLINK_CMD_HEALTH_REPORTER_GET
44   Retrieves status and configuration info per DEV and reporter.
45 DEVLINK_CMD_HEALTH_REPORTER_SET
46   Allows reporter-related configuration setting.
47 DEVLINK_CMD_HEALTH_REPORTER_RECOVER
48   Triggers a reporter's recovery procedure.
49 DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
50   Retrieves diagnostics data from a reporter on a device.
51 DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
52   Retrieves the last stored dump. Devlink health
53   saves a single dump. If an dump is not already stored by the devlink
54   for this reporter, devlink generates a new dump.
55   dump output is defined by the reporter.
56 DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
57   Clears the last saved dump file for the specified reporter.
58
59
60                                                netlink
61                                       +--------------------------+
62                                       |                          |
63                                       |            +             |
64                                       |            |             |
65                                       +--------------------------+
66                                                    |request for ops
67                                                    |(diagnose,
68  mlx5_core                             devlink     |recover,
69                                                    |dump)
70 +--------+                            +--------------------------+
71 |        |                            |    reporter|             |
72 |        |                            |  +---------v----------+  |
73 |        |   ops execution            |  |                    |  |
74 |     <----------------------------------+                    |  |
75 |        |                            |  |                    |  |
76 |        |                            |  + ^------------------+  |
77 |        |                            |    | request for ops     |
78 |        |                            |    | (recover, dump)     |
79 |        |                            |    |                     |
80 |        |                            |  +-+------------------+  |
81 |        |     health report          |  | health handler     |  |
82 |        +------------------------------->                    |  |
83 |        |                            |  +--------------------+  |
84 |        |     health reporter create |                          |
85 |        +---------------------------->                          |
86 +--------+                            +--------------------------+