Documentation/arm/vlocks.txt

   1 vlocks for Bare-Metal Mutual Exclusion
   2 ======================================
   3
   4 Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
   5 mechanism, with reasonable but minimal requirements on the memory
   6 system.
   7
   8 These are intended to be used to coordinate critical activity among CPUs
   9 which are otherwise non-coherent, in situations where the hardware
  10 provides no other mechanism to support this and ordinary spinlocks
  11 cannot be used.
  12
  13
  14 vlocks make use of the atomicity provided by the memory system for
  15 writes to a single memory location.  To arbitrate, every CPU "votes for
  16 itself", by storing a unique number to a common memory location.  The
  17 final value seen in that memory location when all the votes have been
  18 cast identifies the winner.
  19
  20 In order to make sure that the election produces an unambiguous result
  21 in finite time, a CPU will only enter the election in the first place if
  22 no winner has been chosen and the election does not appear to have
  23 started yet.
  24
  25
  26 Algorithm
  27 ---------
  28
  29 The easiest way to explain the vlocks algorithm is with some pseudo-code:
  30
  31
  32         int currently_voting[NR_CPUS] = { 0, };
  33         int last_vote = -1; /* no votes yet */
  34
  35         bool vlock_trylock(int this_cpu)
  36         {
  37                 /* signal our desire to vote */
  38                 currently_voting[this_cpu] = 1;
  39                 if (last_vote != -1) {
  40                         /* someone already volunteered himself */
  41                         currently_voting[this_cpu] = 0;
  42                         return false; /* not ourself */
  43                 }
  44
  45                 /* let's suggest ourself */
  46                 last_vote = this_cpu;
  47                 currently_voting[this_cpu] = 0;
  48
  49                 /* then wait until everyone else is done voting */
  50                 for_each_cpu(i) {
  51                         while (currently_voting[i] != 0)
  52                                 /* wait */;
  53                 }
  54
  55                 /* result */
  56                 if (last_vote == this_cpu)
  57                         return true; /* we won */
  58                 return false;
  59         }
  60
  61         bool vlock_unlock(void)
  62         {
  63                 last_vote = -1;
  64         }
  65
  66
  67 The currently_voting[] array provides a way for the CPUs to determine
  68 whether an election is in progress, and plays a role analogous to the
  69 "entering" array in Lamport's bakery algorithm [1].
  70
  71 However, once the election has started, the underlying memory system
  72 atomicity is used to pick the winner.  This avoids the need for a static
  73 priority rule to act as a tie-breaker, or any counters which could
  74 overflow.
  75
  76 As long as the last_vote variable is globally visible to all CPUs, it
  77 will contain only one value that won't change once every CPU has cleared
  78 its currently_voting flag.
  79
  80
  81 Features and limitations
  82 ------------------------
  83
  84  * vlocks are not intended to be fair.  In the contended case, it is the
  85    _last_ CPU which attempts to get the lock which will be most likely
  86    to win.
  87
  88    vlocks are therefore best suited to situations where it is necessary
  89    to pick a unique winner, but it does not matter which CPU actually
  90    wins.
  91
  92  * Like other similar mechanisms, vlocks will not scale well to a large
  93    number of CPUs.
  94
  95    vlocks can be cascaded in a voting hierarchy to permit better scaling
  96    if necessary, as in the following hypothetical example for 4096 CPUs:
  97
  98         /* first level: local election */
  99         my_town = towns[(this_cpu >> 4) & 0xf];
 100         I_won = vlock_trylock(my_town, this_cpu & 0xf);
 101         if (I_won) {
 102                 /* we won the town election, let's go for the state */
 103                 my_state = states[(this_cpu >> 8) & 0xf];
 104                 I_won = vlock_lock(my_state, this_cpu & 0xf));
 105                 if (I_won) {
 106                         /* and so on */
 107                         I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
 108                         if (I_won) {
 109                                 /* ... */
 110                         }
 111                         vlock_unlock(the_whole_country);
 112                 }
 113                 vlock_unlock(my_state);
 114         }
 115         vlock_unlock(my_town);
 116
 117
 118 ARM implementation
 119 ------------------
 120
 121 The current ARM implementation [2] contains some optimisations beyond
 122 the basic algorithm:
 123
 124  * By packing the members of the currently_voting array close together,
 125    we can read the whole array in one transaction (providing the number
 126    of CPUs potentially contending the lock is small enough).  This
 127    reduces the number of round-trips required to external memory.
 128
 129    In the ARM implementation, this means that we can use a single load
 130    and comparison:
 131
 132         LDR     Rt, [Rn]
 133         CMP     Rt, #0
 134
 135    ...in place of code equivalent to:
 136
 137         LDRB    Rt, [Rn]
 138         CMP     Rt, #0
 139         LDRBEQ  Rt, [Rn, #1]
 140         CMPEQ   Rt, #0
 141         LDRBEQ  Rt, [Rn, #2]
 142         CMPEQ   Rt, #0
 143         LDRBEQ  Rt, [Rn, #3]
 144         CMPEQ   Rt, #0
 145
 146    This cuts down on the fast-path latency, as well as potentially
 147    reducing bus contention in contended cases.
 148
 149    The optimisation relies on the fact that the ARM memory system
 150    guarantees coherency between overlapping memory accesses of
 151    different sizes, similarly to many other architectures.  Note that
 152    we do not care which element of currently_voting appears in which
 153    bits of Rt, so there is no need to worry about endianness in this
 154    optimisation.
 155
 156    If there are too many CPUs to read the currently_voting array in
 157    one transaction then multiple transations are still required.  The
 158    implementation uses a simple loop of word-sized loads for this
 159    case.  The number of transactions is still fewer than would be
 160    required if bytes were loaded individually.
 161
 162
 163    In principle, we could aggregate further by using LDRD or LDM, but
 164    to keep the code simple this was not attempted in the initial
 165    implementation.
 166
 167
 168  * vlocks are currently only used to coordinate between CPUs which are
 169    unable to enable their caches yet.  This means that the
 170    implementation removes many of the barriers which would be required
 171    when executing the algorithm in cached memory.
 172
 173    packing of the currently_voting array does not work with cached
 174    memory unless all CPUs contending the lock are cache-coherent, due
 175    to cache writebacks from one CPU clobbering values written by other
 176    CPUs.  (Though if all the CPUs are cache-coherent, you should be
 177    probably be using proper spinlocks instead anyway).
 178
 179
 180  * The "no votes yet" value used for the last_vote variable is 0 (not
 181    -1 as in the pseudocode).  This allows statically-allocated vlocks
 182    to be implicitly initialised to an unlocked state simply by putting
 183    them in .bss.
 184
 185    An offset is added to each CPU's ID for the purpose of setting this
 186    variable, so that no CPU uses the value 0 for its ID.
 187
 188
 189 Colophon
 190 --------
 191
 192 Originally created and documented by Dave Martin for Linaro Limited, for
 193 use in ARM-based big.LITTLE platforms, with review and input gratefully
 194 received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
 195 grabbing most of this text out of the relevant mail thread and writing
 196 up the pseudocode.
 197
 198 Copyright (C) 2012-2013  Linaro Limited
 199 Distributed under the terms of Version 2 of the GNU General Public
 200 License, as defined in linux/COPYING.
 201
 202
 203 References
 204 ----------
 205
 206 [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
 207     Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
 208
 209     https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
 210
 211 [2] linux/arch/arm/common/vlock.S, www.kernel.org.