Documentation/locking/ww-mutex-design.txt

   1 Wound/Wait Deadlock-Proof Mutex Design
   2 ======================================
   3
   4 Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
   5
   6 Motivation for WW-Mutexes
   7 -------------------------
   8
   9 GPU's do operations that commonly involve many buffers.  Those buffers
  10 can be shared across contexts/processes, exist in different memory
  11 domains (for example VRAM vs system memory), and so on.  And with
  12 PRIME / dmabuf, they can even be shared across devices.  So there are
  13 a handful of situations where the driver needs to wait for buffers to
  14 become ready.  If you think about this in terms of waiting on a buffer
  15 mutex for it to become available, this presents a problem because
  16 there is no way to guarantee that buffers appear in a execbuf/batch in
  17 the same order in all contexts.  That is directly under control of
  18 userspace, and a result of the sequence of GL calls that an application
  19 makes.  Which results in the potential for deadlock.  The problem gets
  20 more complex when you consider that the kernel may need to migrate the
  21 buffer(s) into VRAM before the GPU operates on the buffer(s), which
  22 may in turn require evicting some other buffers (and you don't want to
  23 evict other buffers which are already queued up to the GPU), but for a
  24 simplified understanding of the problem you can ignore this.
  25
  26 The algorithm that the TTM graphics subsystem came up with for dealing with
  27 this problem is quite simple.  For each group of buffers (execbuf) that need
  28 to be locked, the caller would be assigned a unique reservation id/ticket,
  29 from a global counter.  In case of deadlock while locking all the buffers
  30 associated with a execbuf, the one with the lowest reservation ticket (i.e.
  31 the oldest task) wins, and the one with the higher reservation id (i.e. the
  32 younger task) unlocks all of the buffers that it has already locked, and then
  33 tries again.
  34
  35 In the RDBMS literature, a reservation ticket is associated with a transaction.
  36 and the deadlock handling approach is called Wait-Die. The name is based on
  37 the actions of a locking thread when it encounters an already locked mutex.
  38 If the transaction holding the lock is younger, the locking transaction waits.
  39 If the transaction holding the lock is older, the locking transaction backs off
  40 and dies. Hence Wait-Die.
  41 There is also another algorithm called Wound-Wait:
  42 If the transaction holding the lock is younger, the locking transaction
  43 wounds the transaction holding the lock, requesting it to die.
  44 If the transaction holding the lock is older, it waits for the other
  45 transaction. Hence Wound-Wait.
  46 The two algorithms are both fair in that a transaction will eventually succeed.
  47 However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
  48 compared to Wait-Die, but is, on the other hand, associated with more work than
  49 Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
  50 algorithm in that transactions are wounded by other transactions, and that
  51 requires a reliable way to pick up up the wounded condition and preempt the
  52 running transaction. Note that this is not the same as process preemption. A
  53 Wound-Wait transaction is considered preempted when it dies (returning
  54 -EDEADLK) following a wound.
  55
  56 Concepts
  57 --------
  58
  59 Compared to normal mutexes two additional concepts/objects show up in the lock
  60 interface for w/w mutexes:
  61
  62 Acquire context: To ensure eventual forward progress it is important the a task
  63 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
  64 acquired when starting the lock acquisition. This ticket is stored in the
  65 acquire context. Furthermore the acquire context keeps track of debugging state
  66 to catch w/w mutex interface abuse. An acquire context is representing a
  67 transaction.
  68
  69 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
  70 w/w mutexes, since it is required to initialize the acquire context. The lock
  71 class also specifies what algorithm to use, Wound-Wait or Wait-Die.
  72
  73 Furthermore there are three different class of w/w lock acquire functions:
  74
  75 * Normal lock acquisition with a context, using ww_mutex_lock.
  76
  77 * Slowpath lock acquisition on the contending lock, used by the task that just
  78   killed its transaction after having dropped all already acquired locks.
  79   These functions have the _slow postfix.
  80
  81   From a simple semantics point-of-view the _slow functions are not strictly
  82   required, since simply calling the normal ww_mutex_lock functions on the
  83   contending lock (after having dropped all other already acquired locks) will
  84   work correctly. After all if no other ww mutex has been acquired yet there's
  85   no deadlock potential and hence the ww_mutex_lock call will block and not
  86   prematurely return -EDEADLK. The advantage of the _slow functions is in
  87   interface safety:
  88   - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
  89     has a void return type. Note that since ww mutex code needs loops/retries
  90     anyway the __must_check doesn't result in spurious warnings, even though the
  91     very first lock operation can never fail.
  92   - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
  93     ww mutex have been released (preventing deadlocks) and makes sure that we
  94     block on the contending lock (preventing spinning through the -EDEADLK
  95     slowpath until the contended lock can be acquired).
  96
  97 * Functions to only acquire a single w/w mutex, which results in the exact same
  98   semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
  99   context.
 100
 101   Again this is not strictly required. But often you only want to acquire a
 102   single lock in which case it's pointless to set up an acquire context (and so
 103   better to avoid grabbing a deadlock avoidance ticket).
 104
 105 Of course, all the usual variants for handling wake-ups due to signals are also
 106 provided.
 107
 108 Usage
 109 -----
 110
 111 The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
 112 DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
 113 As a rough rule of thumb, use Wound-Wait iff you
 114 expect the number of simultaneous competing transactions to be typically small,
 115 and you want to reduce the number of rollbacks.
 116
 117 Three different ways to acquire locks within the same w/w class. Common
 118 definitions for methods #1 and #2:
 119
 120 static DEFINE_WW_CLASS(ww_class);
 121
 122 struct obj {
 123         struct ww_mutex lock;
 124         /* obj data */
 125 };
 126
 127 struct obj_entry {
 128         struct list_head head;
 129         struct obj *obj;
 130 };
 131
 132 Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
 133 This is useful if a list of required objects is already tracked somewhere.
 134 Furthermore the lock helper can use propagate the -EALREADY return code back to
 135 the caller as a signal that an object is twice on the list. This is useful if
 136 the list is constructed from userspace input and the ABI requires userspace to
 137 not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl).
 138
 139 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 140 {
 141         struct obj *res_obj = NULL;
 142         struct obj_entry *contended_entry = NULL;
 143         struct obj_entry *entry;
 144
 145         ww_acquire_init(ctx, &ww_class);
 146
 147 retry:
 148         list_for_each_entry (entry, list, head) {
 149                 if (entry->obj == res_obj) {
 150                         res_obj = NULL;
 151                         continue;
 152                 }
 153                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
 154                 if (ret < 0) {
 155                         contended_entry = entry;
 156                         goto err;
 157                 }
 158         }
 159
 160         ww_acquire_done(ctx);
 161         return 0;
 162
 163 err:
 164         list_for_each_entry_continue_reverse (entry, list, head)
 165                 ww_mutex_unlock(&entry->obj->lock);
 166
 167         if (res_obj)
 168                 ww_mutex_unlock(&res_obj->lock);
 169
 170         if (ret == -EDEADLK) {
 171                 /* we lost out in a seqno race, lock and retry.. */
 172                 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
 173                 res_obj = contended_entry->obj;
 174                 goto retry;
 175         }
 176         ww_acquire_fini(ctx);
 177
 178         return ret;
 179 }
 180
 181 Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
 182 of duplicate entry detection using -EALREADY as method 1 above. But the
 183 list-reordering allows for a bit more idiomatic code.
 184
 185 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 186 {
 187         struct obj_entry *entry, *entry2;
 188
 189         ww_acquire_init(ctx, &ww_class);
 190
 191         list_for_each_entry (entry, list, head) {
 192                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
 193                 if (ret < 0) {
 194                         entry2 = entry;
 195
 196                         list_for_each_entry_continue_reverse (entry2, list, head)
 197                                 ww_mutex_unlock(&entry2->obj->lock);
 198
 199                         if (ret != -EDEADLK) {
 200                                 ww_acquire_fini(ctx);
 201                                 return ret;
 202                         }
 203
 204                         /* we lost out in a seqno race, lock and retry.. */
 205                         ww_mutex_lock_slow(&entry->obj->lock, ctx);
 206
 207                         /*
 208                          * Move buf to head of the list, this will point
 209                          * buf->next to the first unlocked entry,
 210                          * restarting the for loop.
 211                          */
 212                         list_del(&entry->head);
 213                         list_add(&entry->head, list);
 214                 }
 215         }
 216
 217         ww_acquire_done(ctx);
 218         return 0;
 219 }
 220
 221 Unlocking works the same way for both methods #1 and #2:
 222
 223 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 224 {
 225         struct obj_entry *entry;
 226
 227         list_for_each_entry (entry, list, head)
 228                 ww_mutex_unlock(&entry->obj->lock);
 229
 230         ww_acquire_fini(ctx);
 231 }
 232
 233 Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
 234 e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
 235 and edges can only be changed when holding the locks of all involved nodes. w/w
 236 mutexes are a natural fit for such a case for two reasons:
 237 - They can handle lock-acquisition in any order which allows us to start walking
 238   a graph from a starting point and then iteratively discovering new edges and
 239   locking down the nodes those edges connect to.
 240 - Due to the -EALREADY return code signalling that a given objects is already
 241   held there's no need for additional book-keeping to break cycles in the graph
 242   or keep track off which looks are already held (when using more than one node
 243   as a starting point).
 244
 245 Note that this approach differs in two important ways from the above methods:
 246 - Since the list of objects is dynamically constructed (and might very well be
 247   different when retrying due to hitting the -EDEADLK die condition) there's
 248   no need to keep any object on a persistent list when it's not locked. We can
 249   therefore move the list_head into the object itself.
 250 - On the other hand the dynamic object list construction also means that the -EALREADY return
 251   code can't be propagated.
 252
 253 Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
 254 list of starting nodes (passed in from userspace) using one of the above
 255 methods. And then lock any additional objects affected by the operations using
 256 method #3 below. The backoff/retry procedure will be a bit more involved, since
 257 when the dynamic locking step hits -EDEADLK we also need to unlock all the
 258 objects acquired with the fixed list. But the w/w mutex debug checks will catch
 259 any interface misuse for these cases.
 260
 261 Also, method 3 can't fail the lock acquisition step since it doesn't return
 262 -EALREADY. Of course this would be different when using the _interruptible
 263 variants, but that's outside of the scope of these examples here.
 264
 265 struct obj {
 266         struct ww_mutex ww_mutex;
 267         struct list_head locked_list;
 268 };
 269
 270 static DEFINE_WW_CLASS(ww_class);
 271
 272 void __unlock_objs(struct list_head *list)
 273 {
 274         struct obj *entry, *temp;
 275
 276         list_for_each_entry_safe (entry, temp, list, locked_list) {
 277                 /* need to do that before unlocking, since only the current lock holder is
 278                 allowed to use object */
 279                 list_del(&entry->locked_list);
 280                 ww_mutex_unlock(entry->ww_mutex)
 281         }
 282 }
 283
 284 void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 285 {
 286         struct obj *obj;
 287
 288         ww_acquire_init(ctx, &ww_class);
 289
 290 retry:
 291         /* re-init loop start state */
 292         loop {
 293                 /* magic code which walks over a graph and decides which objects
 294                  * to lock */
 295
 296                 ret = ww_mutex_lock(obj->ww_mutex, ctx);
 297                 if (ret == -EALREADY) {
 298                         /* we have that one already, get to the next object */
 299                         continue;
 300                 }
 301                 if (ret == -EDEADLK) {
 302                         __unlock_objs(list);
 303
 304                         ww_mutex_lock_slow(obj, ctx);
 305                         list_add(&entry->locked_list, list);
 306                         goto retry;
 307                 }
 308
 309                 /* locked a new object, add it to the list */
 310                 list_add_tail(&entry->locked_list, list);
 311         }
 312
 313         ww_acquire_done(ctx);
 314         return 0;
 315 }
 316
 317 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
 318 {
 319         __unlock_objs(list);
 320         ww_acquire_fini(ctx);
 321 }
 322
 323 Method 4: Only lock one single objects. In that case deadlock detection and
 324 prevention is obviously overkill, since with grabbing just one lock you can't
 325 produce a deadlock within just one class. To simplify this case the w/w mutex
 326 api can be used with a NULL context.
 327
 328 Implementation Details
 329 ----------------------
 330
 331 Design:
 332   ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
 333   normal mutex locks, which are far more common. As such there is only a small
 334   increase in code size if wait/wound mutexes are not used.
 335
 336   We maintain the following invariants for the wait list:
 337   (1) Waiters with an acquire context are sorted by stamp order; waiters
 338       without an acquire context are interspersed in FIFO order.
 339   (2) For Wait-Die, among waiters with contexts, only the first one can have
 340       other locks acquired already (ctx->acquired > 0). Note that this waiter
 341       may come after other waiters without contexts in the list.
 342
 343   The Wound-Wait preemption is implemented with a lazy-preemption scheme:
 344   The wounded status of the transaction is checked only when there is
 345   contention for a new lock and hence a true chance of deadlock. In that
 346   situation, if the transaction is wounded, it backs off, clears the
 347   wounded status and retries. A great benefit of implementing preemption in
 348   this way is that the wounded transaction can identify a contending lock to
 349   wait for before restarting the transaction. Just blindly restarting the
 350   transaction would likely make the transaction end up in a situation where
 351   it would have to back off again.
 352
 353   In general, not much contention is expected. The locks are typically used to
 354   serialize access to resources for devices, and optimization focus should
 355   therefore be directed towards the uncontended cases.
 356
 357 Lockdep:
 358   Special care has been taken to warn for as many cases of api abuse
 359   as possible. Some common api abuses will be caught with
 360   CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
 361
 362   Some of the errors which will be warned about:
 363    - Forgetting to call ww_acquire_fini or ww_acquire_init.
 364    - Attempting to lock more mutexes after ww_acquire_done.
 365    - Attempting to lock the wrong mutex after -EDEADLK and
 366      unlocking all mutexes.
 367    - Attempting to lock the right mutex after -EDEADLK,
 368      before unlocking all mutexes.
 369
 370    - Calling ww_mutex_lock_slow before -EDEADLK was returned.
 371
 372    - Unlocking mutexes with the wrong unlock function.
 373    - Calling one of the ww_acquire_* twice on the same context.
 374    - Using a different ww_class for the mutex than for the ww_acquire_ctx.
 375    - Normal lockdep errors that can result in deadlocks.
 376
 377   Some of the lockdep errors that can result in deadlocks:
 378    - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
 379      having called ww_acquire_fini on the first.
 380    - 'normal' deadlocks that can occur.
 381
 382 FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic
 383 implemented.