Writing CTDB cluster mutex helpers ================================== CTDB uses cluster-wide mutexes to protect against a "split brain", which could occur if the cluster becomes partitioned due to network failure or similar. CTDB uses a cluster-wide mutex for its "recovery lock", which is used to ensure that only one database recovery can happen at a time. For an overview of recovery lock configuration see the RECOVERY LOCK section in ctdb(7). CTDB tries to ensure correct operation of the recovery lock by attempting to take the recovery lock when CTDB knows that it should already be held. By default, CTDB uses a supplied mutex helper that uses a fcntl(2) lock on a specified file in the cluster filesystem. However, a user supplied mutex helper can be used as an alternative. The rest of this document describes the API for mutex helpers. A mutex helper is an external executable ---------------------------------------- A mutex helper is an external executable that can be run by CTDB. There are no CTDB-specific compilation dependencies. This means that a helper could easily be scripted around existing commands. Mutex helpers are run relatively rarely and are not time critical. Therefore, reliability is preferred over high performance. Taking a mutex with a helper ---------------------------- 1. Helper is executed with helper-specific arguments 2. Helper attempts to take mutex 3. On success, the helper writes ASCII 0 to standard output 4. Helper stays running, holding mutex, awaiting termination by CTDB 5. When a helper receives SIGTERM it must release any mutex it is holding and then exit. Status codes ------------ CTDB ignores the exit code of a helper. Instead, CTDB reacts to a single ASCII character that is sent to it via a helper's standard output. Valid status codes are: 0 - The helper took the mutex and is holding it, awaiting termination. 1 - The helper was unable to take the mutex due to contention. 2 - The helper took too long to take the mutex. Helpers do not need to implement this status code. CTDB already implements any required timeout handling. 3 - An unexpected error occurred. If a 0 status code is sent then it the helper should periodically check if the (original) parent processes still exists while awaiting termination. If the parent process disappears then the helper should release the mutex and exit. This avoids stale mutexes. Note that a helper should never wait for parent process ID 1! If a non-0 status code is sent then the helper can exit immediately. However, if the helper does not exit then it must terminate if it receives SIGTERM. Logging ------- Anything written to standard error by a helper is incorporated into CTDB's logs. A helper should generally only output to stderr for unexpected errors and avoid output to stderr on success or on mutex contention.