Martin Schwenke [Tue, 16 Jul 2013 09:57:18 +0000 (19:57 +1000)]
tests/eventscripts: Add tests for monitoring of missing interfaces
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 12 Jul 2013 02:48:34 +0000 (12:48 +1000)]
eventscripts: A missing interface should cause monitoring to fail
A missing interface is at least as bad as an interface with a link
that is down so should have a similar effect.
This couldn't be done previously because orphaned interfaces used to
be listed for monitoring. This was worked around in 10.interface in
commit
49b2d1bd9554461ed8edbfc21e777c0eca9e1443 and fixed in ctdbd in
commit
cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.
If $CTDB_PARTIALLY_ONLINE_INTERFACES="yes" then monitoring won't
actually fail but the interface is still marked as down.
While we're touching this code, use "ip link" instead of "ip addr".
It is marginally cheaper but not enough for a separate patch. ;-)
This effectively reverts
d67955b42f7627be9dae995230c8fcbb8a948ec2.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 12 Jul 2013 02:33:36 +0000 (12:33 +1000)]
eventscripts: Get list of configured interfaces using "ctdb ifaces"
This was previosuly changed because ctdbd didn't garbage collect
orphaned interfaces. This was fixed in commit
cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 24 Jun 2013 05:49:48 +0000 (15:49 +1000)]
ctdbd: Allow extra recovery to repair persistent DBs during first recovery
Commit
8076773a9924dcf8aff16f7d96b2b9ac383ecc28 introduced a potential
regression because a node may not have completed the "recovered" event
(so might still be in CTDB_RUNSTATE_FIRST_RECOVERY) when another node
becomes healthy.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Tue, 16 Jul 2013 02:53:16 +0000 (12:53 +1000)]
packaging: Bundle debug_locks.sh script in RPM
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Tue, 16 Jul 2013 02:52:00 +0000 (12:52 +1000)]
packaging: No need to check for existence of scripts, they always do
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Thu, 11 Jul 2013 04:26:38 +0000 (14:26 +1000)]
scripts: ctdbd_wrapper logs a message to syslog if syslog is not being used
It can be very disconcerting when logging to syslog is expected but
nothing is being logged there.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Mathieu Parent [Fri, 7 Jun 2013 17:01:06 +0000 (19:01 +0200)]
Update Nagios check to work with ctdb versions past 30 Aug 2011
Because of commit
a779d83a6213e2ba
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Jul 2013 03:01:13 +0000 (13:01 +1000)]
recoverd: Really fix bogus info in message about changed flags
Commit
9119a568c2b4601318f7751f537dca2f92a7230b attempted to fix this.
However, this was wrong because old_flags and new_flags were confused.
The latter has since been fixed in commit
7eb2f89979360b6cc98ca9b17c48310277fa89fc so this can now be fixed
properly.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 10 Jul 2013 04:44:56 +0000 (14:44 +1000)]
doc: Update NEWS
Signed-off-by: Martin Schwenke <martin@meltin.net>
Sumit Bose [Mon, 19 Nov 2012 17:45:37 +0000 (18:45 +0100)]
Print deleted nodes as well
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Sumit Bose [Thu, 1 Sep 2011 13:18:46 +0000 (15:18 +0200)]
IPv6 neighbor solicit cleanup
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Sumit Bose [Mon, 19 Nov 2012 10:13:03 +0000 (11:13 +0100)]
Fix memory leak in ctdb_send_message()
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Sumit Bose [Wed, 10 Aug 2011 15:53:56 +0000 (17:53 +0200)]
Fixes for various issues found by Coverity
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Sumit Bose [Mon, 19 Nov 2012 10:20:31 +0000 (11:20 +0100)]
Check return value of tdb_delete()
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 11 Jul 2013 03:46:18 +0000 (13:46 +1000)]
web: Update webpages
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 11 Jul 2013 01:34:46 +0000 (11:34 +1000)]
Tests: Correct the arguments to memset
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 10 Jul 2013 04:44:56 +0000 (14:44 +1000)]
doc: Update NEWS
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-programmed-with: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 10 Jul 2013 07:19:55 +0000 (17:19 +1000)]
packaging: Add systemd support
Based on an original patch by Sumit Bose <sbose@redhat.com>.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 10 Jul 2013 06:35:53 +0000 (16:35 +1000)]
build: Turn off all deprecation warnings
The "‘tevent_loop_allow_nesting’ is deprecated" warnings will be
around for a while and are annoying.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 10 Jul 2013 06:30:29 +0000 (16:30 +1000)]
build: Remove -DTEVENT_DEPRECATED_QUIET=1 from CFLAGS
This reverts the last part of
788cdbddbc902a5b076d23473450065b551d274d
- the rest of this has been implicitly reverted via tevent syncs.
This is just leftover noise.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 9 Jul 2013 05:22:07 +0000 (15:22 +1000)]
initscript: Simpify initscript and control CTDB via new ctdbd_wrapper
Currently the initscript is very complex. This makes it hard to read
and hard to add support for new init systems, such as systemd.
Create a wrapper called ctdbd_wrapper to be installed alongside ctdbd.
This is called by the initscript to start and stop ctdbd. It does the
ctdbd option construct and waits until ctdbd is properly initialised
before it exits.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Mon, 8 Jul 2013 02:45:31 +0000 (12:45 +1000)]
recoverd: Recovery daemon should use ctdb_get_pnn, which can't fail
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Wed, 10 Jul 2013 02:23:30 +0000 (12:23 +1000)]
ctdbd: Print tdb flags when logging attached to database message
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Tue, 9 Jul 2013 02:32:53 +0000 (12:32 +1000)]
ctdbd: Set process names for child processes
This helps distinguish processes in process list in top, perf, etc.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Tue, 9 Jul 2013 02:24:59 +0000 (12:24 +1000)]
common/system: Add ctdb_set_process_name() function
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 6 Jun 2013 06:29:04 +0000 (16:29 +1000)]
traverse: Remove unused start_time field
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 6 Jun 2013 06:26:25 +0000 (16:26 +1000)]
traverse: Send records directly from traverse child to srcnode
Currently CTDB daemon reads records from a child process and then sends them to
srcnode via TRAVERSE_DATA control. This ties up main CTDB daemon and also
requires an extra copy of the record in the CTDB daemon. Instead send records
directly from traverse child process.
The control from child process still goes via local CTDB daemon as there
is no infrastructure currently to open a TCP socket to the srcnode.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 6 Jun 2013 06:12:07 +0000 (16:12 +1000)]
traverse: Pass reqid and srcnode information to local database traverse
So that traverse child process can directly send the TRAVERSE_DATA control to
the srcnode without first sending it to local node.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 8 Jul 2013 06:14:59 +0000 (16:14 +1000)]
packaging: When building with system libraries, add dependency for them
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 8 Jul 2013 05:49:58 +0000 (15:49 +1000)]
ctdbd: No need for DeadlockTimeout tunable
The code for deadlock detection and killing smbd process causing deadlock
has been removed and replaced with external debug script.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 8 Jul 2013 05:57:22 +0000 (15:57 +1000)]
initscript: Export CTDB_DEBUG_LOCKS variable
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 8 Jul 2013 05:56:30 +0000 (15:56 +1000)]
scripts: Add an example debug_locks.sh script to debug locking issue
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 8 Jul 2013 05:46:53 +0000 (15:46 +1000)]
locking: Use external script to debug locking issues
Use an external script to parse /proc/locks and log useful debugging
information about locks rather than doing that in C code.
To use this feature, add configuration variable to /etc/sysconfig/ctdb:
CTDB_DEBUG_LOCKS=/etc/ctdb/debug_locks.sh
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 3 Jul 2013 01:01:21 +0000 (11:01 +1000)]
locking: Update locking bucket intervals
0 < 1 ms
1 < 10 ms
2 < 100 ms
3 < 1 s
4 < 2 s
5 < 4 s
6 < 8 s
7 < 16 s
8 < 32 s
9 < 64 s
10 >= 64 s
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 3 Jul 2013 01:46:53 +0000 (11:46 +1000)]
locking: Update locks latency in CTDB statistics only for RECORD or DB locks
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Tue, 25 Jun 2013 05:36:13 +0000 (15:36 +1000)]
tools/ctdb: Fix the format of DB statistics output
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Tue, 25 Jun 2013 05:25:16 +0000 (15:25 +1000)]
ctdbd: Remove incomplete ctdb_db_statistics_wire structure
Send the ctdb_db_statistics directly instead of first copying it to
duplicate ctdb_db_statistics_wire structure. This simplifies the
implementation of the control to get database statistics.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 3 Jul 2013 23:04:49 +0000 (09:04 +1000)]
ctdbd: Update debug messages for setting readonly property on database
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Fri, 5 Jul 2013 04:04:20 +0000 (14:04 +1000)]
recoverd: Fix buffer overflow error in reloadips
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 4 Jul 2013 10:02:29 +0000 (20:02 +1000)]
tests/eventscripts: Add some rudimentary tests for 60.ganesha
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 4 Jul 2013 06:05:01 +0000 (16:05 +1000)]
eventscripts: New configuration variable $CTDB_SKIP_GANESHA_NFSD_CHECK
This allows 60.ganesha to be unit tested, except for the core Ganesha
monitoring code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 4 Jul 2013 06:00:33 +0000 (16:00 +1000)]
eventscript: Move Ganesha nfsd monitoring to a function
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 4 Jul 2013 05:11:54 +0000 (15:11 +1000)]
eventscripts: Drop RPC service version from nfs_check_rpc_service() calls
Support for this was removed in commit
77302dbfd85754e02559eccb2dd6c090db0b6b9f and I overlooked its use in
60.ganesha.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Tue, 2 Jul 2013 04:43:17 +0000 (14:43 +1000)]
ctdbd: Log something when releasing all IPs
At the moment this is silent and it can be confusing to see IPs just
disappear.
Also, this message:
Been in recovery mode for too long. Dropping all IPS
can cause anxiety when all IPs should already have been dropped.
Adding a comforting message saying that 0 IPs were dropped relieves
such anxiety. :-)
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 09:00:36 +0000 (19:00 +1000)]
recoverd: Minor style improvements for ctdb_reload_remote_public_ips()
* Add a variable to the loop to make the code more readable and have
it generally fit into 80 columns.
* Improve comments.
* Improve log messages.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 08:45:46 +0000 (18:45 +1000)]
recoverd: Clean up log messages in remote IP verification
The log messages in verify_remote_ip_allocation() are confusing
because they don't include the PNN of the problem node, because it is
not known in this function.
Add the PNN of the node being verified as a function argument and then
shuffle the log messages around to make them clearer.
Also fold 3 nested if statements into just one.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 07:57:33 +0000 (17:57 +1000)]
recoverd: Fix an unclear log message - "Restart recovery process"
When the recovery master notices a node in recovery mode it starts the
recovery process, it doesn't restart it.
Update documentation to match.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 07:53:37 +0000 (17:53 +1000)]
recoverd: Fix an incorrect comment
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 07:48:01 +0000 (17:48 +1000)]
ctdbd: Use ctdb_die() on "setup" event failure
This is slightly easier to read because it all fits on 1 line.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 07:43:52 +0000 (17:43 +1000)]
ctdbd: Avoid a core dump when "init" event fails
The "init" event only really fails in the scripts, which should log
something useful on failure. Therefore, a core dump isn't terribly
useful and sometimes attracts unwanted attention.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 07:42:11 +0000 (17:42 +1000)]
util: New function ctdb_die()
This is like ctdb_fatal() but exits cleanly without dumping core or
generating a backtrace.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 24 Jun 2013 09:03:26 +0000 (19:03 +1000)]
eventscripts: When replaying monitor status, don't log empty output
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 24 Jun 2013 06:05:03 +0000 (16:05 +1000)]
ctdbd: Release IP callback should fail if the IP is still hosted
At the moment there (at least) are 2 bugs that cause rogue IPs:
* A race where release_ip_callback() runs after a "subsequent" take IP
has completed. The IP is back on an interface but we unset
vnn->iface in the callback.
* A "releaseip" eventscript times out. We ignore the timeout and call
it success, deleting the VNN even if the IP is still hosted.
We could decide not to ignore the timeout and ban the node, but
killing TCP connections can take a long time and that might result
in a lot of manning. We probably won't reinstate banning on
"releaseip" until killing TCP connections has been optimised.
In both cases, a rogue IP can be avoided by leaving vnn->iface set and
simply failing the control.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Mon, 24 Jun 2013 05:49:48 +0000 (15:49 +1000)]
ctdbd: Log warnings in release IP when unexpected interface is encountered
Previous code changes work around a potential problems but do not
provide useful information when the a problem occurs.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 07:37:05 +0000 (17:37 +1000)]
ping_pong: Validate num_locks argument > 0
This fixes the floating point error if num_locks = 0.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 07:27:00 +0000 (17:27 +1000)]
tests: If connection to ctdb daemon fails, exit
This fixes the segmentation error if any of the test code fails to
connect to CTDB daemon.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 07:00:23 +0000 (17:00 +1000)]
build: Fix compiler warnings for uninitialized variables
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 05:36:29 +0000 (15:36 +1000)]
recoverd: Send the result from child process only once
The result has been sent before the child keeps waiting for parent
ctdbd process.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 05:31:52 +0000 (15:31 +1000)]
packaging: Enable compiler optimizations
This reverts
d09570c70551aa40390ce9ceffe7bc234e1afafe.
... hoping the segv has been found in last 6 years. :-)
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 05:14:10 +0000 (15:14 +1000)]
packaging: Allow building RPMs with system tdb/talloc/tevent
To build CTDB RPMs with system installed libraries, use following command:
./packaging/RPM/makerpms.sh \
--with system_talloc \
--with system_tdb \
--with system_tevent
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 04:29:09 +0000 (14:29 +1000)]
packaging: Do not mark /etc/ctdb/functions as configuration file
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 4 Jul 2013 03:19:56 +0000 (13:19 +1000)]
packaging: Install README.notify.d using %doc directive
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Thu, 4 Jul 2013 02:45:32 +0000 (12:45 +1000)]
packaging: Install docs using %doc directive
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Thu, 4 Jul 2013 01:33:38 +0000 (11:33 +1000)]
packaging: Remove ctdb_transaction from docdir
It's bundled in ctdb-tests package.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Sun, 30 Jun 2013 07:23:08 +0000 (17:23 +1000)]
doc: Add a disclaimer for the EnableBans tunable
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Sun, 30 Jun 2013 07:22:06 +0000 (17:22 +1000)]
doc: Add banning bug fixes to NEWS
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Tue, 2 Jul 2013 02:40:37 +0000 (12:40 +1000)]
ctdbd: Don't ban self if init or shutdown event fails
There is no point in banning the node if init or shutdown event times
out since it's going to quit anyway.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 27 Jun 2013 07:46:43 +0000 (17:46 +1000)]
doc: The second half of monitoring is only for recovery master
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Michael Adam [Wed, 26 Jun 2013 07:23:22 +0000 (09:23 +0200)]
recoverd: when the recmaster is banned, use that information when forcing an election
When we trigger an election because the recmaster considers itself inactive,
update our local nodemap with the recmaster's flags before calling
force_election(). This way, we don't send the inactive node freeze commands
(e.g.) that may fail and then lead to ourselves getting banned.
The theory is that this should help avoiding banning loops.
Signed-off-by: Michael Adam <obnox@samba.org>
Michael Adam [Wed, 26 Jun 2013 05:11:51 +0000 (07:11 +0200)]
recoverd: fix a comment typo
Signed-off-by: Michael Adam <obnox@samba.org>
Michael Adam [Fri, 21 Jun 2013 15:57:37 +0000 (17:57 +0200)]
recoverd: fix a comment in main_loop
Signed-off-by: Michael Adam <obnox@samba.org>
Michael Adam [Fri, 21 Jun 2013 12:06:22 +0000 (14:06 +0200)]
recoverd: eliminate some trailing spaces from ctdb_election_win()
Signed-off-by: Michael Adam <obnox@samba.org>
Martin Schwenke [Fri, 28 Jun 2013 06:31:07 +0000 (16:31 +1000)]
recoverd: Don't continue if the current node gets banned
Can not continue with recovery or monitoring cluster.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Fri, 28 Jun 2013 04:31:02 +0000 (14:31 +1000)]
recoverd: Refactor code to ban misbehaving nodes
Since we have nodemap information, there is no need to hardcode the
limit of 20.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Thu, 27 Jun 2013 06:01:16 +0000 (16:01 +1000)]
recoverd: Move code to ban other nodes after we get local node flags
If a node gets banned first, then it should not ban other nodes.
This code was moved up in main_loop to avoid waiting for nodemap
from other nodes (commit
83b0261f2cb453195b86f547d360400103a8b795).
To prevent a banned node from banning other nodes, we need to first get
nodemap information from local node, so trying to ban other nodes can
fail if we are already banned.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 27 Jun 2013 05:44:27 +0000 (15:44 +1000)]
recoverd: Delay the initial election if node is started in stopped state
Since there is an early exit if a node is stopped or banned, we can wait till
the node becomes active to start initial election.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 27 Jun 2013 05:33:49 +0000 (15:33 +1000)]
recoverd: Update capabilities only if the current node is active
Since we do an early return if a node is stopped or banned, move update
capabilities code below the early return and just before we check the
capabilities of current recovery master.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 27 Jun 2013 05:46:04 +0000 (15:46 +1000)]
recoverd: No need to check if node is recovery master when inactive
If a node is stopped or banned, it will cause early return from the
main_loop, so this check is redundent. The election will called by an
active node.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 27 Jun 2013 05:39:15 +0000 (15:39 +1000)]
recoverd: Always do an early exit from main_loop if node is stopped or banned
A stopped or banned node cannot do anything useful. So do not participate
in any cluster activity and do not cause any unnecessary network traffic.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Fri, 28 Jun 2013 04:10:47 +0000 (14:10 +1000)]
recoverd: Do not set banning credits on a node if current node is inactive
If the current node is banned or stopped, then it should not assign banning
credits to other nodes since the current node will not have up-to-date flags
of other nodes.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 1 Jul 2013 07:40:36 +0000 (17:40 +1000)]
banning: Do not come out of ban if databases are not frozen
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 24 Jun 2013 04:33:32 +0000 (14:33 +1000)]
banning: No need to check if banned pnn is for local node
If the banned pnn is not the local node, the function returns early.
So no need for additional check.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Fri, 28 Jun 2013 04:04:18 +0000 (14:04 +1000)]
banning: Make ctdb_local_node_got_banned() a void function
When this function is called, we are already committed to banning
and there is no point in failing this function. In case, freezing of
databases fails, it will be fixed from recovery daemon.
Amitay Isaacs [Fri, 28 Jun 2013 04:02:44 +0000 (14:02 +1000)]
recoverd: Also check if current node is in recovery when it is banned
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Fri, 28 Jun 2013 04:09:35 +0000 (14:09 +1000)]
recoverd: Set node_flags information as soon as we get nodemap
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 26 Jun 2013 06:02:23 +0000 (16:02 +1000)]
recovered: Remove old comment as the code corresponding to that has gone away
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 24 Jun 2013 04:31:50 +0000 (14:31 +1000)]
banning: Log ban state changes for other nodes at higher debug level
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 1 Jul 2013 06:28:04 +0000 (16:28 +1000)]
freeze: Make ctdb_start_freeze() a void function
If this function fails due to memory errors, there is no way to recover.
The best course of action is to abort.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 1 Jul 2013 06:21:00 +0000 (16:21 +1000)]
freeze: If priority is invalid here, it's time to abort
ctdb_start_freeze() is called from ctdb_control_freeze() which fixes the
priority if it's 0 and return error if it's invalid. Other callers of
ctdb_start_freeze() are internal to CTDB. So if priority is invalid in
ctdb_start_freeze(), definitely something is seriously wrong.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 1 Jul 2013 03:26:33 +0000 (13:26 +1000)]
freeze: Log message from ctdb_start_freeze() and ctdb_control_freeze()
This ensures that whenever databases are frozen either via sending
control or by calling ctdb_start_freeze(), the action is logged.
Since ctdb_control_freeze() calls ctdb_start_freeze(), move logging of
message in early return condition if databases are already frozen.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 24 Jun 2013 04:18:58 +0000 (14:18 +1000)]
recoverd: Print banning message only after verifying pnn
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 26 Jun 2013 05:22:46 +0000 (15:22 +1000)]
recoverd: When updating flags on nodes, send updated flags and not old flags
This was broken by commit
a9a1156ea4e10483a4bf4265b8e9203f0af033aa.
Instead of a SRVID_SET_NODE_FLAGS message to recovery daemon, a control
was sent to the local daemon which in turn informed the recovery daemon.
And while doing this change old flags were sent via CONTROL_MODIFY_FLAGS.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Wed, 26 Jun 2013 04:34:47 +0000 (14:34 +1000)]
tools/ctdb: Add "force" option to "recover" command
At the moment there is no easy way to force a recovery when attempting
to reproduce certain classes of bugs. This option is added without
documentation because it is dangerous until the bugs are fixed! :-)
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Mon, 24 Jun 2013 07:37:15 +0000 (17:37 +1000)]
client: Exit with non-zero status when unix socket is closed
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Fri, 21 Jun 2013 04:49:20 +0000 (14:49 +1000)]
doc: Fix ctdb ping entry in manpage
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 21 Jun 2013 04:47:20 +0000 (14:47 +1000)]
doc: Fix documentation for NoIPTakeover in ctdbd manpage
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 21 Jun 2013 04:33:12 +0000 (14:33 +1000)]
doc: Update notification script section in ctdbd manpage
The example notification script is now much more useful.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 21 Jun 2013 04:32:50 +0000 (14:32 +1000)]
doc: Add nodestatus command to the ctdb manpage
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 21 Jun 2013 00:52:05 +0000 (10:52 +1000)]
doc: Update NEWS
Signed-off-by: Martin Schwenke <martin@meltin.net>