doc/recovery-process.txt

   1 Valid as of 1.0.66, may/will change in the future
   2
   3
   4 RECMASTER
   5 =========
   6 Recovery Master, this is one of the nodes in the cluster that has been designated to
   7 be the "recovery master".
   8 The recovery master is responsible for performing full checks of cluster and cluster node consistency and is also responsible for performing the actual database recovery procedure.
   9
  10 Only one node at a time can be the recovery master.
  11 This is ensured by CTDB using a lock on a single file in the shared gpfs filesystem:
  12   /etc/sysconfig/ctdb :
  13   ...
  14   # Options to ctdbd. This is read by /etc/init.d/ctdb
  15   # you must specify the location of a shared lock file across all the
  16   # nodes. This must be on shared storage
  17   # there is no default here
  18   CTDB_RECOVERY_LOCK=/gpfs/.ctdb/shared
  19   ...
  20
  21 In order to prevent that two nodes become recovery master at the same time (==split brain)
  22 CTDB here relies on GPFS that GPFS will guarantee coherent locking across the cluster.
  23 Thus CTDB relies on that GPFS MUST only allow one ctdb process on one node to take out and
  24 hold this lock.
  25
  26 The recovery master is designated through an election process.
  27
  28
  29 VNNMAP
  30 ======
  31 The VNNMAP is a list of all nodes in the cluster that is currently part of the cluster
  32 and participates in hosting the cluster databases.
  33 All nodes that are CONNECTED but not BANNED be present in the VNNMAP.
  34
  35 The VNNMAP is the list of LMASTERS for the cluster as reported by 'ctdb status' "
  36   ...
  37   Size:3
  38   hash:0 lmaster:0
  39   hash:1 lmaster:1
  40   hash:2 lmaster:2
  41   ...
  42
  43
  44 CLUSTER MONITORING
  45 ==================
  46 All nodes in the cluster monitor its own health and its own consistency regards to the
  47 recovery master. How and what the nodes monitor for differs between the node which is
  48 the recovery master and normal nodes.
  49 This monitoring it to ensure that the cluster is healthy and consistent.
  50 This is not related to monitoring of inidividual node health, a.k.a. eventscript monitroing.
  51
  52 At the end of each step in the process are listed some of the most common and important
  53 error messages that can be generated during that step.
  54
  55
  56 NORMAL NODE CLUSTER MONITORING
  57 ------------------------------
  58 Monitoring is performed in the dedicated recovery daemon process.
  59 The implementation can be found in server/ctdb_recoverd.c:monitor_cluster()
  60 This is an overview of the more important tasks during monitoring.
  61 These tests are to verify that the local node is consistent with the recovery master.
  62
  63 Once every second the following monitoring loop is performed :
  64
  65 1, Verify that the parent ctdb daemon on the local node is still running.
  66    If it is not, the recovery daemon logs an error and terminates.
  67    "CTDB daemon is no longer available. Shutting down recovery daemon"
  68
  69 2, Check if any of the nodes has been recorded to have misbehaved too many times.
  70    If so we ban the node and log a message :
  71    "Node %u has caused %u failures in %.0f seconds - banning it for %u seconds"
  72
  73 3, Check that there is a recovery master.
  74    If not we initiate a clusterwide election process and log :
  75    "Initial recovery master set - forcing election"
  76    and we restart monitoring from 1.
  77
  78 4, Verify that recovery daemon and the local ctdb daemon agreed on all the
  79    node BANNING flags.
  80    If the recovery daemon and the local ctdb daemon disagrees on these flags we update
  81    the local ctdb daemon, logs one of two messages and restarts monitoring from 1 again.
  82    "Local ctdb daemon on recmaster does not think this node is BANNED but the recovery master disagrees. Unbanning the node"
  83    "Local ctdb daemon on non-recmaster does not think this node is BANNED but the recovery master disagrees. Re-banning the node"
  84
  85 5, Verify that the node designated to be recovery master exists in the local list of all nodes.
  86    If the recovery master is not in the list of all cluster nodes a new recovery master
  87    election is triggered and monitoring restarts from 1.
  88    "Recmaster node %u not in list. Force reelection"
  89
  90 6, Check if the recovery master has become disconnected.
  91    If is has, log an error message, force a new election and restart monitoring from 1.
  92    "Recmaster node %u is disconnected. Force reelection"
  93
  94 7, Read the node flags off the recovery master and verify that it has not become banned.
  95    If is has, log an error message, force a new election and restart monitoring from 1.
  96    "Recmaster node %u no longer available. Force reelection"
  97
  98 8, Verify that the recmaster and the local node agrees on the flags (BANNED/DISABLED/...)
  99    for the local node.
 100    If there is an inconsistency, push the flags for the local node out to all other nodes.
 101    "Recmaster disagrees on our flags flags:0x%x recmaster_flags:0x%x  Broadcasting out flags."
 102
 103 9, Verify that the local node hosts all public ip addresses it should host and that it does
 104    NOT host any public addresses it should not host.
 105    If there is an inconsistency we log an error, trigger a recovery to occur and restart
 106    monitoring from 1 again.
 107    "Public address '%s' is missing and we should serve this ip"
 108    "We are still serving a public address '%s' that we should not be serving."
 109
 110 These are all the checks we perform during monitoring for a normal node.
 111 These tests are performed on all nodes in the cluster which is why it is optimized to perform
 112 as few network calls to other nodes as possible.
 113 Each node only performs 1 call to the recovery master in each loop and to no other nodes.
 114
 115 NORMAL NODE CLUSTER MONITORING
 116 ------------------------------
 117 The recovery master performs a much more extensive test. In addition to tests 1-9 above
 118 the recovery master also performs the following tests:
 119
 120 10, Read the list of nodes and flags from all other CONNECTED nodes in the cluster.
 121     If there is a failure to read this list from one of the nodes, then log an
 122     error, mark this node as a candidate to become BANNED and restart monitoring from 1.
 123     "Unable to get nodemap from remote node %u"
 124
 125 11, Verify that the local recovery master and the remote node agrees on the flags
 126     for the remote node. If there is a inconsistency for the BANNING flag,
 127     log an error, trigger a new recmaster election and restart monitoring from 1.
 128     "Remote node %u had different BANNED flags 0x%x, local had 0x%x - trigger a re-election"
 129     "Remote node %u had flags 0x%x, local had 0x%x - updating local"
 130
 131 12, Verify that the local recovery master and the remote node agrees on the flags
 132     for the remote node. If one of the flags other than the BANNING flag was inconsistent,
 133     just update the set of flags for the local recovery daemon, log an information message
 134     and continue monitoring.
 135     "Remote node %u had flags 0x%x, local had 0x%x - updating local"
 136
 137 13, Read the list of public ip addresses from all of the CONNECTED nodes and merge into a
 138     single clusterwide list.
 139     If we fail to read the list of ips from a node, log an error and restart monitoring from 1.
 140     "Failed to read public ips from node : %u"
 141
 142 14, Verify that all other nodes agree that this node is the recovery master.
 143     If one of the other nodes discgrees this is the recovery master, log an error,
 144     force a new election and restart monitoring from 1.
 145     "Node %d does not agree we are the recmaster. Need a new recmaster election"
 146
 147 15, Check if the previous attempt to run a recovery failed, and if it did, try a new recovery.
 148     After the recovery, restart monitoring from 1.
 149     "Starting do_recovery"
 150
 151 16, Verify that all CONNECTED nodes in the cluster are in recovery mode NORMAL.
 152     If one of the nodes were in recovery mode ACTIVE, force a new recovery and restart
 153     monitoring from 1.
 154     "Node:%u was in recovery mode. Restart recovery process"
 155
 156 17, Verify that the filehandle to the recovery lock file is valid.
 157     If it is not, this may mean a split brain and is a critical error.
 158     Try a new recovery and restart monitoring from 1.
 159     "recovery master doesn't have the recovery lock"
 160
 161 18, Verify that GPFS allows us to read from the recovery lock file.
 162     If not there is a critical GPFS issue and we may have a split brain.
 163     Try forcing a new recovery and restart monitoring from 1.
 164     "failed read from recovery_lock_fd - %s"
 165
 166 19, Read the list of all nodes and flags from all CONNECTED nodes in the cluster.
 167     If fail to read the nodemap from one of the remote nodes, log an error and restart
 168     monitoring from 1.
 169     "Unable to get nodemap from remote node %u"
 170
 171 20, If the nodemap differs between the local node and the remote node, log an error
 172     and force a recovery.
 173     This would happen if the /etc/ctdb/nodes file differs across nodes in the cluster.
 174     It is unlikely that the recovery will rectify the situation.
 175     This is a critical error, it is most likely the entire cluster will be unavailable
 176     until the files are fixed or have became banned.
 177     "Remote node:%u has different node count. %u vs %u of the local node"
 178
 179 21, If a remote node disagrees on the content of the nodes list, try a recovery and restart
 180     monitoring from 1.
 181     It is unlikely that the recovery will rectify the situation.
 182     This is a critical error, it is most likely the entire cluster will be unavailable
 183     until the files are fixed or have became banned.
 184     "Remote node:%u has different nodemap pnn for %d (%u vs %u)."
 185
 186 22, If a remote node disgrees on the node flags in the list, try a recovery to re-sync
 187     the flags and restart monitoring from 1.
 188     "Remote node:%u has different nodemap flag for %d (0x%x vs 0x%x)"
 189
 190 23, Verify that all active nodes are part of the VNNMAP.
 191     If not, this would be a new node that has become CONNECTED but does not yet participate
 192     in the cluster.
 193     Perform a recovery to merge the new node to the cluster and restart monitoring from 1.
 194     "The vnnmap count is different from the number of active nodes. %u vs %u"
 195     or
 196     "Node %u is active in the nodemap but did not exist in the vnnmap"
 197
 198 24, Read the VNNMAP from all CONNECTED nodes.
 199     Verify that all nodes have the same VNNMAP content and that all nodes are in the same
 200     generation instance of the databases.
 201     If not, force a recovery to re-synchronize the vnnmap and the databases across the cluster
 202     and restart monitoring from 1.
 203     "Remote node %u has different generation of vnnmap. %u vs %u (ours)"
 204     "Remote node %u has different size of vnnmap. %u vs %u (ours)"
 205     "Remote node %u has different vnnmap."
 206
 207 25, If there has been changes to the cluster that requires a reallocation of public ip
 208     addresses. On all nodes run the "startrecovery" event. Run "releaseip" and "takeip"
 209     events to reassign the ips across the cluster and finally run the "recovered" event.
 210
 211 Finished monitoring, continue monitoring from 1.
 212
 213
 214 CLUSTER RECOVERY
 215 ================
 216 Recoveries are driven by the recovery daemon on the node that is currently the recovery
 217 master.
 218 Most of the logging that is performed during recovery is only logged on the node that
 219 is the recovery master.
 220 Make sure to find which node is the recovery master and check the log for that node.
 221
 222 Example log entries that start in column 1 are expected to be present in the
 223 log. Example log entries that are indented 3 columns are optional and might
 224 only be present if an error occured.
 225
 226
 227 1, Log that recovery has been initiated.
 228 "Starting do_recovery"
 229
 230    It might log an informational message :
 231 "New recovery culprit %u".
 232    This is only semi-accurate and might may not mean that there is any problem
 233    at all with the node indicated.
 234
 235
 236 2, Check if a node has caused too many failed recoveries and if so ban it from
 237    the cluster, giving the other nodes in the cluster a chance to recovery
 238    operation.
 239    "Node %u has caused %u recoveries in %.0f seconds - banning it for %u seconds"
 240
 241
 242 3, Verify that the recovery daemon can lock the recovery lock file.
 243    At this stage this should be recovery master.
 244    If this operation fails it means we have a split brain and have to abort recovery.
 245    "("ctdb_recovery_lock: Unable to open %s - (%s)"
 246    "ctdb_recovery_lock: Failed to get recovery lock on '%s'"
 247    "Unable to get recovery lock - aborting recovery"
 248 "ctdb_recovery_lock: Got recovery lock on '%s'"
 249
 250
 251 4, Log which node caused the recovery to be initiated.
 252    This is a semi-accurate information message only.
 253    This line does NOT mean that there has to be something wrong with the node listed.
 254 "Recovery initiated due to problem with node %u"
 255
 256
 257 5, Pull the names of all databases from all nodes and verify that these databases also
 258    exists locally.
 259    If a database is missing locally, just create it.
 260    It is not an error if a database is missing locally. Databases are created on demand and
 261    this could happen if it was one database which samba has never tried to access on the
 262    local node.
 263
 264
 265 6, Check the list of databases on each remote node and create any databases that may be missing
 266    on the remote node.
 267 "Recovery - created remote databases"
 268
 269
 270 7, Set recovery mode to ACTIVE on all remote nodes.
 271
 272
 273 8, Run the "startrecovery" eventscript on all nodes.
 274
 275    At this stage you will also get a few additional log entries, these are not
 276    from the recovery daemon but from the main ctdb daemon due to running
 277    the eventscript :
 278 "startrecovery eventscript has been invoked"
 279 "Monitoring has been disabled"
 280 "Executing event script ...
 281 ...
 282
 283
 284 9, Create a new generation id and update the generation id and the VNNMAP on the local node
 285    only.
 286    This guarantees that the generation id will now be inconsistent across the cluster and
 287    that if recovery fails a new recovery is attempted in the next iteration of the monitoring
 288    loop.
 289
 290
 291 10, Start a TDB TRANSACTION on all nodes for all databases.
 292    This is to ensure that if recovery is aborted or fails that we do not
 293    modify any databases on only some of the nodes.
 294 "started transactions on all nodes"
 295
 296
 297 11, For each database, pull the content from all CONNECTED nodes and merge it into
 298     the TDB databases on the local node.
 299     This merges the records from the remote nodes based on their serial numbers so we
 300     only keep the most recent record found.
 301 "Recovery - pulled remote database 0x%x"
 302
 303
 304 12, For each database, perform a fast TDB WIPE operation to delete the entire TDB under the
 305     transaction started above.
 306
 307
 308 13, For each database, drop all empty records.
 309     Force the DMASTER field of all records to point to the recovery master.
 310     Push the database out to all other nodes.
 311
 312     The PUSH process lists some additional log entries for each database of the
 313     form :
 314 "Recovery - pulled remote database 0x..."
 315 "Recovery - pushed remote database 0x... of size ..."
 316
 317
 318 14, Commit all changes to all TDB databases.
 319 "Recovery - starting database commits"
 320 "Recovery - committed databases"
 321
 322
 323 15, Create a new VNNMAP of all CONNECTED nodes, create a new generation number
 324     and piush this new VNNMAP out to all nodes.
 325 "Recovery - updated vnnmap"
 326
 327
 328 16, Update all nodes that the local node is the recovery master.
 329 "Recovery - updated recmaster"
 330
 331
 332 17, synchronize node flags across the cluster.
 333 "Recovery - updated flags"
 334
 335 18, Change recovery mode back to NORMAL.
 336 "Recovery - disabled recovery mode"
 337
 338
 339 19, Re-allocate all public ip addresses across the cluster.
 340 "Deterministic IPs enabled. Resetting all ip allocations"
 341
 342     If the IP address allocation on the local node changes you might get
 343     "Takeover of IP 10.0.0.201/24 on interface eth0"
 344     "Release of IP 10.0.0.204/24 on interface eth0"
 345
 346 "Recovery - takeip finished"
 347
 348
 349 20, Run the "recovered" eventscript on all nodes.
 350 "Recovery - finished the recovered event"
 351
 352     You will also get an entry from the local ctdb daemon itself that it has
 353     switched back to recovery mode NORMAL.
 354 "Recovery has finished"
 355
 356
 357 21, Broadcast a message to all samba daemons in the cluster that the databases have been
 358     recovered. Samba will now do some additional checking/cleanup of the content in the stored
 359     records.
 360
 361 "Recovery complete"
 362
 363
 364 22. Finished. At this stage a 10 second timeout (ctdb listvars : rerecoverytimeout) is
 365     initiated. The cluster will not allow a new recovery to be performed until this timeout
 366     has expired.
 367
 368 "New recoveries supressed for the rerecovery timeout"
 369 "Rerecovery timeout elapsed. Recovery reactivated."
 370
 371
 372
 373
 374
 375
 376
 377 Example : RECOVERY LOG ON RECMASTER
 378 ====================================
 379 2008/12/01 09:57:28.110732 [ 4933]: 10.0.0.21:4379: node 10.0.0.24:4379 is dead:
 380  2 connected
 381 2008/12/01 09:57:28.110838 [ 4933]: Tearing down connection to dead node :3
 382 2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:2682 The vnnmap count
 383  is different from the number of active nodes. 4 vs 3
 384 2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1327 Starting do_reco
 385 very
 386 2008/12/01 09:57:28.967297 [ 4935]: ctdb_recovery_lock: Got recovery lock on '/g
 387 pfs/.ctdb/shared'
 388 2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1355 Recovery initiat
 389 ed due to problem with node 0
 390 2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1381 Recovery - creat
 391 ed remote databases
 392 2008/12/01 09:57:28.973543 [ 4933]: server/ctdb_recover.c:589 Recovery mode set
 393 to ACTIVE
 394 2008/12/01 09:57:28.974823 [ 4933]: server/ctdb_recover.c:904 startrecovery even
 395 tscript has been invoked
 396 2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1431 started transact
 397 ions on all nodes
 398 2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 399 d remote database 0x42fe72c5
 400 2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 401 d remote database 0x42fe72c5 of size 0
 402 2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 403 d remote database 0x1421fb78
 404 2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 405 d remote database 0x1421fb78 of size 0
 406 2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 407 d remote database 0xc0bdde6a
 408 2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 409 d remote database 0xc0bdde6a of size 0
 410 2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 411 d remote database 0x17055d90
 412 2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 413 d remote database 0x17055d90 of size 8
 414 2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 415 d remote database 0x7bbbd26c
 416 2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 417 d remote database 0x7bbbd26c of size 1
 418 2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 419 d remote database 0xf2a58948
 420 2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 421 d remote database 0xf2a58948 of size 51
 422 2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 423 d remote database 0x92380e87
 424 2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 425 d remote database 0x92380e87 of size 17
 426 2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 427 d remote database 0x63501287
 428 2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 429 d remote database 0x63501287 of size 1
 430 2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 431 d remote database 0xe98e08b6
 432 2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 433 d remote database 0xe98e08b6 of size 4
 434 2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 435 d remote database 0x2672a57f
 436 2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 437 d remote database 0x2672a57f of size 28
 438 2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
 439 d remote database 0xb775fff6
 440 2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
 441 d remote database 0xb775fff6 of size 6
 442 2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1440 Recovery - start
 443 ing database commits
 444 2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1452 Recovery - commi
 445 tted databases
 446 2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1502 Recovery - updat
 447 ed vnnmap
 448 2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1511 Recovery - updat
 449 ed recmaster
 450 2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1522 Recovery - updat
 451 ed flags
 452 2008/12/01 09:57:29.305235 [ 4933]: server/ctdb_recover.c:589 Recovery mode set
 453 to NORMAL
 454 2008/12/01 09:57:29.307245 [ 4935]: server/ctdb_recoverd.c:1531 Recovery - disab
 455 led recovery mode
 456 2008/12/01 09:57:29.307245 [ 4935]: Deterministic IPs enabled. Resetting all ip
 457 allocations
 458 2008/12/01 09:57:29.311071 [ 4933]: takeoverip called for an ip '10.0.0.201' tha
 459 t is not a public address
 460 2008/12/01 09:57:29.311186 [ 4933]: takeoverip called for an ip '10.0.0.202' tha
 461 t is not a public address
 462 2008/12/01 09:57:29.311204 [ 4933]: takeoverip called for an ip '10.0.0.203' tha
 463 t is not a public address
 464 2008/12/01 09:57:29.311299 [ 4933]: takeoverip called for an ip '10.0.0.204' tha
 465 t is not a public address
 466 2008/12/01 09:57:29.537210 [ 4935]: server/ctdb_recoverd.c:1542 Recovery - takei
 467 p finished
 468 2008/12/01 09:57:29.545404 [ 4933]: Recovery has finished
 469 2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1551 Recovery - finis
 470 hed the recovered event
 471 2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1557 Recovery complet
 472 e
 473 2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1565 New recoveries s
 474 upressed for the rerecovery timeout
 475 2008/12/01 09:57:39.815648 [ 4935]: server/ctdb_recoverd.c:1567 Rerecovery timeo
 476 ut elapsed. Recovery reactivated.
 477
 478
 479
 480
 481
 482
 483
 484