Documentation/ww-mutex-design.txt

   1 Wait/Wound Deadlock-Proof Mutex Design
   2 ======================================
   3
   4 Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
   5
   6 Motivation for WW-Mutexes
   7 -------------------------
   8
   9 GPU's do operations that commonly involve many buffers.  Those buffers
  10 can be shared across contexts/processes, exist in different memory
  11 domains (for example VRAM vs system memory), and so on.  And with
  12 PRIME / dmabuf, they can even be shared across devices.  So there are
  13 a handful of situations where the driver needs to wait for buffers to
  14 become ready.  If you think about this in terms of waiting on a buffer
  15 mutex for it to become available, this presents a problem because
  16 there is no way to guarantee that buffers appear in a execbuf/batch in
  17 the same order in all contexts.  That is directly under control of
  18 userspace, and a result of the sequence of GL calls that an application
  19 makes.  Which results in the potential for deadlock.  The problem gets
  20 more complex when you consider that the kernel may need to migrate the
  21 buffer(s) into VRAM before the GPU operates on the buffer(s), which
  22 may in turn require evicting some other buffers (and you don't want to
  23 evict other buffers which are already queued up to the GPU), but for a
  24 simplified understanding of the problem you can ignore this.
  25
  26 The algorithm that the TTM graphics subsystem came up with for dealing with
  27 this problem is quite simple.  For each group of buffers (execbuf) that need
  28 to be locked, the caller would be assigned a unique reservation id/ticket,
  29 from a global counter.  In case of deadlock while locking all the buffers
  30 associated with a execbuf, the one with the lowest reservation ticket (i.e.
  31 the oldest task) wins, and the one with the higher reservation id (i.e. the
  32 younger task) unlocks all of the buffers that it has already locked, and then
  33 tries again.
  34
  35 In the RDBMS literature this deadlock handling approach is called wait/wound:
  36 The older tasks waits until it can acquire the contended lock. The younger tasks
  37 needs to back off and drop all the locks it is currently holding, i.e. the
  38 younger task is wounded.
  39
  40 Concepts
  41 --------
  42
  43 Compared to normal mutexes two additional concepts/objects show up in the lock
  44 interface for w/w mutexes:
  45
  46 Acquire context: To ensure eventual forward progress it is important the a task
  47 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
  48 acquired when starting the lock acquisition. This ticket is stored in the
  49 acquire context. Furthermore the acquire context keeps track of debugging state
  50 to catch w/w mutex interface abuse.
  51
  52 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
  53 w/w mutexes, since it is required to initialize the acquire context.
  54
  55 Furthermore there are three different class of w/w lock acquire functions:
  56
  57 * Normal lock acquisition with a context, using ww_mutex_lock.
  58
  59 * Slowpath lock acquisition on the contending lock, used by the wounded task
  60   after having dropped all already acquired locks. These functions have the
  61   _slow postfix.
  62
  63   From a simple semantics point-of-view the _slow functions are not strictly
  64   required, since simply calling the normal ww_mutex_lock functions on the
  65   contending lock (after having dropped all other already acquired locks) will
  66   work correctly. After all if no other ww mutex has been acquired yet there's
  67   no deadlock potential and hence the ww_mutex_lock call will block and not
  68   prematurely return -EDEADLK. The advantage of the _slow functions is in
  69   interface safety:
  70   - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
  71     has a void return type. Note that since ww mutex code needs loops/retries
  72     anyway the __must_check doesn't result in spurious warnings, even though the
  73     very first lock operation can never fail.
  74   - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
  75     ww mutex have been released (preventing deadlocks) and makes sure that we
  76     block on the contending lock (preventing spinning through the -EDEADLK
  77     slowpath until the contended lock can be acquired).
  78
  79 * Functions to only acquire a single w/w mutex, which results in the exact same
  80   semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
  81   context.
  82
  83   Again this is not strictly required. But often you only want to acquire a
  84   single lock in which case it's pointless to set up an acquire context (and so
  85   better to avoid grabbing a deadlock avoidance ticket).
  86
  87 Of course, all the usual variants for handling wake-ups due to signals are also
  88 provided.
  89
  90 Usage
  91 -----
  92
  93 Three different ways to acquire locks within the same w/w class. Common
  94 definitions for methods #1 and #2:
  95
  96 static DEFINE_WW_CLASS(ww_class);
  97
  98 struct obj {
  99         struct ww_mutex lock;
 100         /* obj data */
 101 };
 102
 103 struct obj_entry {
 104         struct list_head head;
 105         struct obj *obj;
 106 };
 107
 108 Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
 109 This is useful if a list of required objects is already tracked somewhere.
 110 Furthermore the lock helper can use propagate the -EALREADY return code back to
 111 the caller as a signal that an object is twice on the list. This is useful if
 112 the list is constructed from userspace input and the ABI requires userspace to
 113 not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl).
 114
 115 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 116 {
 117         struct obj *res_obj = NULL;
 118         struct obj_entry *contended_entry = NULL;
 119         struct obj_entry *entry;
 120
 121         ww_acquire_init(ctx, &ww_class);
 122
 123 retry:
 124         list_for_each_entry (entry, list, head) {
 125                 if (entry->obj == res_obj) {
 126                         res_obj = NULL;
 127                         continue;
 128                 }
 129                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
 130                 if (ret < 0) {
 131                         contended_entry = entry;
 132                         goto err;
 133                 }
 134         }
 135
 136         ww_acquire_done(ctx);
 137         return 0;
 138
 139 err:
 140         list_for_each_entry_continue_reverse (entry, list, head)
 141                 ww_mutex_unlock(&entry->obj->lock);
 142
 143         if (res_obj)
 144                 ww_mutex_unlock(&res_obj->lock);
 145
 146         if (ret == -EDEADLK) {
 147                 /* we lost out in a seqno race, lock and retry.. */
 148                 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
 149                 res_obj = contended_entry->obj;
 150                 goto retry;
 151         }
 152         ww_acquire_fini(ctx);
 153
 154         return ret;
 155 }
 156
 157 Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
 158 of duplicate entry detection using -EALREADY as method 1 above. But the
 159 list-reordering allows for a bit more idiomatic code.
 160
 161 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 162 {
 163         struct obj_entry *entry, *entry2;
 164
 165         ww_acquire_init(ctx, &ww_class);
 166
 167         list_for_each_entry (entry, list, head) {
 168                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
 169                 if (ret < 0) {
 170                         entry2 = entry;
 171
 172                         list_for_each_entry_continue_reverse (entry2, list, head)
 173                                 ww_mutex_unlock(&entry2->obj->lock);
 174
 175                         if (ret != -EDEADLK) {
 176                                 ww_acquire_fini(ctx);
 177                                 return ret;
 178                         }
 179
 180                         /* we lost out in a seqno race, lock and retry.. */
 181                         ww_mutex_lock_slow(&entry->obj->lock, ctx);
 182
 183                         /*
 184                          * Move buf to head of the list, this will point
 185                          * buf->next to the first unlocked entry,
 186                          * restarting the for loop.
 187                          */
 188                         list_del(&entry->head);
 189                         list_add(&entry->head, list);
 190                 }
 191         }
 192
 193         ww_acquire_done(ctx);
 194         return 0;
 195 }
 196
 197 Unlocking works the same way for both methods #1 and #2:
 198
 199 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 200 {
 201         struct obj_entry *entry;
 202
 203         list_for_each_entry (entry, list, head)
 204                 ww_mutex_unlock(&entry->obj->lock);
 205
 206         ww_acquire_fini(ctx);
 207 }
 208
 209 Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
 210 e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
 211 and edges can only be changed when holding the locks of all involved nodes. w/w
 212 mutexes are a natural fit for such a case for two reasons:
 213 - They can handle lock-acquisition in any order which allows us to start walking
 214   a graph from a starting point and then iteratively discovering new edges and
 215   locking down the nodes those edges connect to.
 216 - Due to the -EALREADY return code signalling that a given objects is already
 217   held there's no need for additional book-keeping to break cycles in the graph
 218   or keep track off which looks are already held (when using more than one node
 219   as a starting point).
 220
 221 Note that this approach differs in two important ways from the above methods:
 222 - Since the list of objects is dynamically constructed (and might very well be
 223   different when retrying due to hitting the -EDEADLK wound condition) there's
 224   no need to keep any object on a persistent list when it's not locked. We can
 225   therefore move the list_head into the object itself.
 226 - On the other hand the dynamic object list construction also means that the -EALREADY return
 227   code can't be propagated.
 228
 229 Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
 230 list of starting nodes (passed in from userspace) using one of the above
 231 methods. And then lock any additional objects affected by the operations using
 232 method #3 below. The backoff/retry procedure will be a bit more involved, since
 233 when the dynamic locking step hits -EDEADLK we also need to unlock all the
 234 objects acquired with the fixed list. But the w/w mutex debug checks will catch
 235 any interface misuse for these cases.
 236
 237 Also, method 3 can't fail the lock acquisition step since it doesn't return
 238 -EALREADY. Of course this would be different when using the _interruptible
 239 variants, but that's outside of the scope of these examples here.
 240
 241 struct obj {
 242         struct ww_mutex ww_mutex;
 243         struct list_head locked_list;
 244 };
 245
 246 static DEFINE_WW_CLASS(ww_class);
 247
 248 void __unlock_objs(struct list_head *list)
 249 {
 250         struct obj *entry, *temp;
 251
 252         list_for_each_entry_safe (entry, temp, list, locked_list) {
 253                 /* need to do that before unlocking, since only the current lock holder is
 254                 allowed to use object */
 255                 list_del(&entry->locked_list);
 256                 ww_mutex_unlock(entry->ww_mutex)
 257         }
 258 }
 259
 260 void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 261 {
 262         struct obj *obj;
 263
 264         ww_acquire_init(ctx, &ww_class);
 265
 266 retry:
 267         /* re-init loop start state */
 268         loop {
 269                 /* magic code which walks over a graph and decides which objects
 270                  * to lock */
 271
 272                 ret = ww_mutex_lock(obj->ww_mutex, ctx);
 273                 if (ret == -EALREADY) {
 274                         /* we have that one already, get to the next object */
 275                         continue;
 276                 }
 277                 if (ret == -EDEADLK) {
 278                         __unlock_objs(list);
 279
 280                         ww_mutex_lock_slow(obj, ctx);
 281                         list_add(&entry->locked_list, list);
 282                         goto retry;
 283                 }
 284
 285                 /* locked a new object, add it to the list */
 286                 list_add_tail(&entry->locked_list, list);
 287         }
 288
 289         ww_acquire_done(ctx);
 290         return 0;
 291 }
 292
 293 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 294 {
 295         __unlock_objs(list);
 296         ww_acquire_fini(ctx);
 297 }
 298
 299 Method 4: Only lock one single objects. In that case deadlock detection and
 300 prevention is obviously overkill, since with grabbing just one lock you can't
 301 produce a deadlock within just one class. To simplify this case the w/w mutex
 302 api can be used with a NULL context.
 303
 304 Implementation Details
 305 ----------------------
 306
 307 Design:
 308   ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
 309   normal mutex locks, which are far more common. As such there is only a small
 310   increase in code size if wait/wound mutexes are not used.
 311
 312   In general, not much contention is expected. The locks are typically used to
 313   serialize access to resources for devices. The only way to make wakeups
 314   smarter would be at the cost of adding a field to struct mutex_waiter. This
 315   would add overhead to all cases where normal mutexes are used, and
 316   ww_mutexes are generally less performance sensitive.
 317
 318 Lockdep:
 319   Special care has been taken to warn for as many cases of api abuse
 320   as possible. Some common api abuses will be caught with
 321   CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
 322
 323   Some of the errors which will be warned about:
 324    - Forgetting to call ww_acquire_fini or ww_acquire_init.
 325    - Attempting to lock more mutexes after ww_acquire_done.
 326    - Attempting to lock the wrong mutex after -EDEADLK and
 327      unlocking all mutexes.
 328    - Attempting to lock the right mutex after -EDEADLK,
 329      before unlocking all mutexes.
 330
 331    - Calling ww_mutex_lock_slow before -EDEADLK was returned.
 332
 333    - Unlocking mutexes with the wrong unlock function.
 334    - Calling one of the ww_acquire_* twice on the same context.
 335    - Using a different ww_class for the mutex than for the ww_acquire_ctx.
 336    - Normal lockdep errors that can result in deadlocks.
 337
 338   Some of the lockdep errors that can result in deadlocks:
 339    - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
 340      having called ww_acquire_fini on the first.
 341    - 'normal' deadlocks that can occur.
 342
 343 FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic
 344 implemented.