ctdb-recoverd: Call an election when the recovery lock is lost
authorMartin Schwenke <martin@meltin.net>
Thu, 8 Nov 2018 04:49:30 +0000 (15:49 +1100)
committerAmitay Isaacs <amitay@gmail.com>
Tue, 18 Dec 2018 02:36:47 +0000 (13:36 +1100)
The lock may have been lost due to a failure in the underlying locking
mechanism.  This could be due to quorum loss or similar.  It is best
to call an election to confirm that this node should still be master.
At worst, the node will reelect itself, fail to take the lock and then
ban itself.  This is a suitable outcome for a node that has been
partitioned from others in the cluster.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
ctdb/server/ctdb_recoverd.c

index f000538..578127a 100644 (file)
@@ -915,20 +915,19 @@ static void take_reclock_handler(char status,
        s->locked = (status == '0') ;
 }
 
-static bool ctdb_recovery_lock(struct ctdb_recoverd *rec);
+static void force_election(struct ctdb_recoverd *rec,
+                          uint32_t pnn,
+                          struct ctdb_node_map_old *nodemap);
 
 static void lost_reclock_handler(void *private_data)
 {
        struct ctdb_recoverd *rec = talloc_get_type_abort(
                private_data, struct ctdb_recoverd);
 
-       DEBUG(DEBUG_ERR,
-             ("Recovery lock helper terminated unexpectedly - "
-              "trying to retake recovery lock\n"));
+       D_ERR("Recovery lock helper terminated, triggering an election\n");
        TALLOC_FREE(rec->recovery_lock_handle);
-       if (! ctdb_recovery_lock(rec)) {
-               DEBUG(DEBUG_ERR, ("Failed to take recovery lock\n"));
-       }
+
+       force_election(rec, ctdb_get_pnn(rec->ctdb), rec->nodemap);
 }
 
 static bool ctdb_recovery_lock(struct ctdb_recoverd *rec)