ctdb/doc/readonlyrecords.txt

   1 Read-Only locks in CTDB
   2 =======================
   3
   4 Problem
   5 =======
   6 CTDB currently only supports exclusive Read-Write locks for clients(samba) accessing the
   7 TDB databases.
   8 This mostly works well but when very many clients are accessing the same file,
   9 at the same time, this causes the exclusive lock as well as the record itself to
  10 rapidly bounce between nodes and acts as a scalability limitation.
  11
  12 This primarily affects locking.tdb and brlock.tdb, two databases where record access is
  13 read-mostly and where writes are semi-rare.
  14
  15 For the common case, if CTDB provided shared non-exclusive Read-Only lock semantics
  16 this would greatly improve scaling for these workloads.
  17
  18
  19 Desired properties
  20 ==================
  21 We can not make backward incompatible changes the ctdb_ltdb header for the records.
  22
  23 A Read-Only lock enabled ctdb demon must be able to interoperate with a non-Read-Only
  24 lock enbled daemon.
  25
  26 Getting a Read-Only lock should not be slower than getting a Read-Write lock.
  27
  28 When revoking Read-Only locks for a record, this should involve only those nodes that
  29 currently hold a Read-Only lock and should avoid broadcasting opportunistic revocations.
  30 (must track which nodes are delegated to)
  31
  32 When a Read-Write lock is requested, if there are Read-Only locks delegated to other
  33 nodes, the DMASTER will defer the record migration until all read-only locks are first
  34 revoked (synchronous revoke).
  35
  36 Due to the cost of revoking Read-Only locks has on getting a Read-Write lock, the
  37 implementation should try to avoid creating Read-Only locks unless it has indication
  38 that there is contention. This may mean that even if client requests a Read-Only lock
  39 we might still provide a full Read-Write lock in order to avoid the cost of revoking
  40 the locks in some cases.
  41
  42 Read-Only locks require additional state to be stored in a separate database, containing
  43 information about which nodes have have been delegated Read-Only locks.
  44 This database should be kept at minimal size.
  45
  46 Read-Only locks should not significantly complicate the normal record
  47 create/migration/deletion cycle for normal records.
  48
  49 Read-Only locks should not complicate the recovery process.
  50
  51 Read-Only locks should not complicate the vacuuming process.
  52
  53 We should avoid forking new child processes as far as possible from the main daemon.
  54
  55 Client-side implementation, samba, libctdb, others, should have minimal impact when
  56 Read-Only locks are implemented.
  57 Client-side implementation must be possible with only minor conditionals added to the
  58 existing lock-check-fetch-unlock loop that clients use today for Read-Write locks. So
  59 that clients only need one single loop that can handle both Read-Write locking as well
  60 as Read-Only locking. Clients should not need two nearly identical loops.
  61
  62
  63 Implementation
  64 ==============
  65
  66 Four new flags are allocated in the ctdb_ltdb record header.
  67 HAVE_DELEGATIONS, HAVE_READONLY_LOCK, REVOKING_READONLY and REVOKE_COMPLETE
  68
  69 HAVE_DELEGATIONS is a flag that can only be set on the node that is currently the
  70 DMASTER for the record. When set, this flag indicates that there are Read-Only locks
  71 delegated to other nodes in the cluster for this record.
  72
  73 HAVE_READONLY is a flag that is only set on nodes that are NOT the DMASTER for the
  74 record. If set this flag indicates that this record contains an up-to-date Read-Only
  75 version of this record. A client that only needs to read, but not to write, the record
  76 can safely use the content of this record as is regardless of the value of the DMASTER
  77 field of the record.
  78
  79 REVOKING_READONLY is a flag that is used while a set of read only delegations are being
  80 revoked.
  81 This flag is only set when HAVE_DELEGATIONS is also set, and is cleared at the same time
  82 as HAVE_DELEGATIONS is cleared.
  83 Normal operations is that first the HAVE_DELEGATIONS flag is set when the first
  84 delegation is generated. When the delegations are about to be revoked, the
  85 REVOKING_READONLY flag is set too.
  86 Once all delegations are revoked, both flags are cleared at the same time.
  87 While REVOKING_READONLY is set, any requests for the record, either normal request or
  88 request for readonly will be deferred.
  89 Deferred requests are linked on a list for deferred requests until the time that the
  90 revokation is completed.
  91 This flags is set by the main ctdb daemon when it starts revoking this record.
  92
  93 REVOKE_COMPLETE
  94 The actual revoke of records is done by a child process, spawned from the main ctdb
  95 daemon when it starts the process to revoke the records.
  96 Once the child process has finished revoking all delegations it will set the flag
  97 REVOKE_COMPLETE for this record to signal to the main daemon that the record has been
  98 successfully revoked.
  99 At this stage the child process will also trigger an event in the main daemon that
 100 revoke is complete and that the main dameon should start re-processing all deferred
 101 requests.
 102
 103
 104
 105 Once the revoke process is completed there will be at least one deferred request to
 106 access this record. That is the initical call to for an exclusive fetch_lock() that
 107 triggered the revoke process to be started.
 108 In addition to this deferred request there may also be additional requests that have
 109 also become deferred while the revoke was in process. These can be either exclusive
 110 fetch_locks() or they can be readonly lock requests.
 111 Once the revoke is completed the main daemon will reprocess all exclusive fetch_lock()
 112 requests immediately and respond to these clients.
 113 Any requests for readadonly lock requests will be deferred for an additional period of
 114 time before they are re-processed.
 115 This is to allow the client that needs a fetch_lock() to update the record to get some
 116 time to access and work on the record without having to compete with the possibly
 117 very many readonly requests.
 118
 119
 120
 121
 122
 123 The ctdb_db structure is expanded so that it contains one extra TDB database for each
 124 normal, non-persistent datbase.
 125 This new database is used for tracking delegations for the records.
 126 A record in the normal database that has "HAVE_DELEGATION" set will always have a
 127 corresponding record at the same key. This record contains the set of all nodes that
 128 the record is delegated to.
 129 This tracking database is lockless, using TDB_NOLOCK, and is only ever accessed by
 130 the main ctdbd daemon.
 131 The lockless nature and the fact that no other process ever access this TDB means we
 132 are guaranteed non-blocking access to records in the tracking database.
 133
 134 The ctdb_call PDU is allocated with a new flag WANT_READONLY and possibly also a new
 135 callid: CTDB_FETCH_WITH_HEADER_FUNC.
 136 This new function returns not only the record, as CTDB_FETCH_FUNC does, but also
 137 returns the full ctdb_ltdb record HEADER prepended to the record.
 138 This function is optional, clients that do not care what the header is can continue
 139 using just CTDB_FETCH_FUNC
 140
 141
 142 This flag is used to requesting a read-only record from the DMASTER/LMASTER.
 143 If the record does not yet exist, this is a returned as an error to the client and the
 144 client will retry the request loop.
 145
 146 A new control is added to make remote nodes remove the HAVE_READONLY_LOCK from a record
 147 and to invalidate any deferred readonly copies from the databases.
 148
 149
 150
 151 Client implementation
 152 =====================
 153 Clients today use a loop for record fetch lock that looks like this
 154     try_again:
 155         lock record in tdb
 156
 157         if record does not exist in tdb,
 158             unlock record
 159             ask ctdb to migrate record onto the node
 160             goto try_again
 161
 162         if record dmaster != this node pnn
 163             unlock record
 164             ask ctdb to migrate record onto the node
 165             goto try_again
 166
 167     finished:
 168
 169 where we basically spin, until the record is migrated onto the node and we have managed
 170 to pin it down.
 171
 172 This will change to instead to something like
 173
 174     try_again:
 175         lock record in tdb
 176
 177         if record does not exist in tdb,
 178             unlock record
 179             ask ctdb to migrate record onto the node
 180             goto try_again
 181
 182         if record dmaster == current node pnn
 183             goto finished
 184
 185         if read-only lock
 186             if HAVE_READONLY_LOCK or HAVE_DELEGATIONS is set
 187                 goto finished
 188             else
 189                 unlock record
 190                 ask ctdb for read-only copy (WANT_READONLY[|WITH_HEADER])
 191                 if failed to get read-only copy (*A)
 192                     ask ctdb to migrate the record onto the node
 193                     goto try_again
 194                 lock record in tdb
 195                 goto finished
 196
 197         unlock record
 198         ask ctdb to migrate record onto the node
 199         goto try_again
 200
 201     finished:
 202
 203 If the record does not yet exist in the local TDB, we always perform a full fetch for a
 204 Read-Write lock even if only a Read-Only lock was requested.
 205 This means that for first access we always grab a Read-Write lock and thus upgrade any
 206 requests for Read-Only locks into a Read-Write request.
 207 This creates the record, migrates it onto the node and makes the local node become
 208 the DMASTER for the record.
 209
 210 Future reference to this same record by the local samba daemons will still access/lock
 211 the record locally without triggereing a Read-Only delegation to be created since the
 212 record is already hosted on the local node as DMASTER.
 213
 214 Only if the record is contended, i.e. it has been created an migrated onto the node but
 215 we are no longer the DMASTER for this record, only for this case will we create a
 216 Read-Only delegation.
 217 This heuristics provide a mechanism where we will not create Read-Only delegations until
 218 we have some indication that the record may be contended.
 219
 220 This avoids creating and revoking Read-Only delegations when only a single client is
 221 repeatedly accessing the same set of records.
 222 This also aims to limit the size of the tracking tdb.
 223
 224
 225 Server implementation
 226 =====================
 227 When receiving a ctdb_call with the WANT_READONLY flag:
 228
 229 If this is the LMASTER for the record and the record does not yet exist, LMASTER will
 230 return an error back to the client (*A above) and the client will try to recover.
 231 In particular, LMASTER will not create a new record for this case.
 232
 233 If this is the LMASTER for the record and the record exists, the PDU will be forwarded to
 234 the DMASTER for the record.
 235
 236 If this node is not the DMASTER for this record, we forward the PDU back to the
 237 LMASTER. Just as we always do today.
 238
 239 If this is the DMASTER for the record, we need to create a Read-Only delegation.
 240 This is done by
 241      lock record
 242      increase the RSN by one for this record
 243      set the HAVE_DELEGATIONS flag for the record
 244      write the updated record to the TDB
 245      create/update the tracking TDB nd add this new node to the set of delegations
 246      send a modified copy of the record back to the requesting client.
 247          modifications are that RSN is decremented by one, so delegated records are "older" than on the DMASTER,
 248          it has HAVE_DELEGATIONS flag stripped off, and has HAVE_READONLY_LOCK added.
 249      unlock record
 250
 251 Important to note is that this does not trigger a record migration.
 252
 253
 254 When receiving a ctdb_call without the WANT_READONLY flag:
 255
 256 If this is the DMASTER for the this might trigger a migration. If there exists
 257 delegations we must first revoke these before allowing the Read-Write request from
 258 proceeding. So,
 259 IF the record has HAVE_DELEGATIONS set, we create a child process and defer processing
 260 of this PDU until the child process has completed.
 261
 262 From the child process we will call out to all nodes that have delegations for this
 263 record and tell them to invalidate this record by clearing the HAVE_READONLY_LOCK from
 264 the record.
 265 Once all delegated nodes respond back, the child process signals back to the main daemon
 266 the revoke has completed. (child process may not access the tracking tdb since it is
 267 lockless)
 268
 269 Main process is triggered to re-process the PDU once the child process has finished.
 270 Main daemon deletes the corresponding record in the tracking database, clears the
 271 HAVE_DELEGATIONS flag for the record and then proceeds to perform the migration as usual.
 272
 273 When receiving a ctdb_call without the flag we want all delegations to be revoked,
 274 so we must take care that the delegations are revoked unconditionally before we even
 275 check if we are already the DMASTER (in which case thie ctdb_call would normally just
 276 be  no-op  (*B below))
 277
 278
 279
 280 Recovery process changes
 281 ========================
 282 A recovery implicitly clears/revokes any read only records and delegations from all
 283 databases.
 284
 285 During delegations of Read-Only locks, this is done in such way that delegated records
 286 will have a RSN smaller than the DMASTER. This guarantees that read-only copies always
 287 have a RSN that is smaller than the DMASTER.
 288
 289 During recoveries we do not need to take any special action other than always picking
 290 the copy of the record that has the highest RSN, which is what we already do today.
 291
 292 During the recovery process, we strip all flags off all records while writing the new
 293 content of the database during the PUSH_DB control.
 294
 295 During processing of the PUSH_DB control and once the new database has been written we
 296 then also wipe the tracking database.
 297
 298 This makes changes to the recovery process minimal and nonintrusive.
 299
 300
 301
 302 Vacuuming process
 303 =================
 304 Vacuuming needs only minimal changes.
 305
 306
 307 When vacuuming runs, it will do a fetch_lock to migrate any remote records back onto the
 308 LMASTER before the record can be purged. This will automatically force all delegations
 309 for that record to be revoked before the migration is copied back onto the LMASTER.
 310 This handles the case where LMASTER is not the DMASTER for the record that will be
 311 purged.
 312 The migration in this case does force any delegations to be revoked before the
 313 vacuuming takes place.
 314
 315 Missing is the case when delegations exist and the LMASTER is also the DMASTER.
 316 For this case we need to change the vacuuming to unconditionally always try to do a
 317 fetch_lock when HAVE_DELEGATIONS is set, even if the record is already stored locally.
 318 (*B)
 319 This fetch lock will not cause any migrations by the ctdb daemon, but since it does
 320 not have the WANT_READONLY this will still force the delegations to be revoked but no
 321 migration will trigger.
 322
 323
 324 Traversal process
 325 =================
 326 Traversal process is changed to ignore any records with the HAVE_READONLY_LOCK
 327
 328
 329 Forward/Backward Compatibility
 330 ==============================
 331 Non-readonly locking daemons must be able to interoperate with readonly locking enabled daemons.
 332
 333 Non-readonly enabled daemons fetching records from Readonly enabled daemons:
 334 Non-readonly enabled daemons do not know, and never set the WANT_READONLY flag so these daemons will always request a full migration for a full fetch-lock for all records. Thus a request from a non-readonly enabled daemon will always cause any existing delegations to be immediately revoked. Access will work but performance may be harmed since there will be a lot of revoking of delegations.
 335
 336 Readonly enabled dameons fetching records with WANT_READONLY from non-readonly enabled daemons:
 337 Non-readonly enabled daemons ingore the WANT_READONLY flag and never return delegations. They always return a full record migration.
 338 Full record migration is allowed by the protocol, even if the originator only requests the 'hint' WANT_READONLY,
 339 so this access also interoperates between daemons with different capabilities.
 340
 341
 342
 343