amitay/ctdb.git
10 years agotests/eventscripts: Add tests for monitoring of missing interfaces
Martin Schwenke [Tue, 16 Jul 2013 09:57:18 +0000 (19:57 +1000)]
tests/eventscripts: Add tests for monitoring of missing interfaces

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: A missing interface should cause monitoring to fail
Martin Schwenke [Fri, 12 Jul 2013 02:48:34 +0000 (12:48 +1000)]
eventscripts: A missing interface should cause monitoring to fail

A missing interface is at least as bad as an interface with a link
that is down so should have a similar effect.

This couldn't be done previously because orphaned interfaces used to
be listed for monitoring.  This was worked around in 10.interface in
commit 49b2d1bd9554461ed8edbfc21e777c0eca9e1443 and fixed in ctdbd in
commit cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.

If $CTDB_PARTIALLY_ONLINE_INTERFACES="yes" then monitoring won't
actually fail but the interface is still marked as down.

While we're touching this code, use "ip link" instead of "ip addr".
It is marginally cheaper but not enough for a separate patch.  ;-)

This effectively reverts d67955b42f7627be9dae995230c8fcbb8a948ec2.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Get list of configured interfaces using "ctdb ifaces"
Martin Schwenke [Fri, 12 Jul 2013 02:33:36 +0000 (12:33 +1000)]
eventscripts: Get list of configured interfaces using "ctdb ifaces"

This was previosuly changed because ctdbd didn't garbage collect
orphaned interfaces.  This was fixed in commit
cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Allow extra recovery to repair persistent DBs during first recovery
Martin Schwenke [Mon, 24 Jun 2013 05:49:48 +0000 (15:49 +1000)]
ctdbd: Allow extra recovery to repair persistent DBs during first recovery

Commit 8076773a9924dcf8aff16f7d96b2b9ac383ecc28 introduced a potential
regression because a node may not have completed the "recovered" event
(so might still be in CTDB_RUNSTATE_FIRST_RECOVERY) when another node
becomes healthy.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agopackaging: Bundle debug_locks.sh script in RPM
Amitay Isaacs [Tue, 16 Jul 2013 02:53:16 +0000 (12:53 +1000)]
packaging: Bundle debug_locks.sh script in RPM

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: No need to check for existence of scripts, they always do
Amitay Isaacs [Tue, 16 Jul 2013 02:52:00 +0000 (12:52 +1000)]
packaging: No need to check for existence of scripts, they always do

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoscripts: ctdbd_wrapper logs a message to syslog if syslog is not being used
Martin Schwenke [Thu, 11 Jul 2013 04:26:38 +0000 (14:26 +1000)]
scripts: ctdbd_wrapper logs a message to syslog if syslog is not being used

It can be very disconcerting when logging to syslog is expected but
nothing is being logged there.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoUpdate Nagios check to work with ctdb versions past 30 Aug 2011
Mathieu Parent [Fri, 7 Jun 2013 17:01:06 +0000 (19:01 +0200)]
Update Nagios check to work with ctdb versions past 30 Aug 2011

Because of commit a779d83a6213e2ba

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Really fix bogus info in message about changed flags
Martin Schwenke [Thu, 11 Jul 2013 03:01:13 +0000 (13:01 +1000)]
recoverd: Really fix bogus info in message about changed flags

Commit 9119a568c2b4601318f7751f537dca2f92a7230b attempted to fix this.
However, this was wrong because old_flags and new_flags were confused.
The latter has since been fixed in commit
7eb2f89979360b6cc98ca9b17c48310277fa89fc so this can now be fixed
properly.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Update NEWS
Martin Schwenke [Wed, 10 Jul 2013 04:44:56 +0000 (14:44 +1000)]
doc: Update NEWS

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoPrint deleted nodes as well
Sumit Bose [Mon, 19 Nov 2012 17:45:37 +0000 (18:45 +0100)]
Print deleted nodes as well

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoIPv6 neighbor solicit cleanup
Sumit Bose [Thu, 1 Sep 2011 13:18:46 +0000 (15:18 +0200)]
IPv6 neighbor solicit cleanup

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoFix memory leak in ctdb_send_message()
Sumit Bose [Mon, 19 Nov 2012 10:13:03 +0000 (11:13 +0100)]
Fix memory leak in ctdb_send_message()

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoFixes for various issues found by Coverity
Sumit Bose [Wed, 10 Aug 2011 15:53:56 +0000 (17:53 +0200)]
Fixes for various issues found by Coverity

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoCheck return value of tdb_delete()
Sumit Bose [Mon, 19 Nov 2012 10:20:31 +0000 (11:20 +0100)]
Check return value of tdb_delete()

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoweb: Update webpages
Amitay Isaacs [Thu, 11 Jul 2013 03:46:18 +0000 (13:46 +1000)]
web: Update webpages

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoTests: Correct the arguments to memset
Amitay Isaacs [Thu, 11 Jul 2013 01:34:46 +0000 (11:34 +1000)]
Tests: Correct the arguments to memset

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agodoc: Update NEWS
Amitay Isaacs [Wed, 10 Jul 2013 04:44:56 +0000 (14:44 +1000)]
doc: Update NEWS

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-programmed-with: Martin Schwenke <martin@meltin.net>

10 years agopackaging: Add systemd support
Martin Schwenke [Wed, 10 Jul 2013 07:19:55 +0000 (17:19 +1000)]
packaging: Add systemd support

Based on an original patch by Sumit Bose <sbose@redhat.com>.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agobuild: Turn off all deprecation warnings
Martin Schwenke [Wed, 10 Jul 2013 06:35:53 +0000 (16:35 +1000)]
build: Turn off all deprecation warnings

The "‘tevent_loop_allow_nesting’ is deprecated" warnings will be
around for a while and are annoying.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agobuild: Remove -DTEVENT_DEPRECATED_QUIET=1 from CFLAGS
Martin Schwenke [Wed, 10 Jul 2013 06:30:29 +0000 (16:30 +1000)]
build: Remove -DTEVENT_DEPRECATED_QUIET=1 from CFLAGS

This reverts the last part of 788cdbddbc902a5b076d23473450065b551d274d
- the rest of this has been implicitly reverted via tevent syncs.
This is just leftover noise.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoinitscript: Simpify initscript and control CTDB via new ctdbd_wrapper
Martin Schwenke [Tue, 9 Jul 2013 05:22:07 +0000 (15:22 +1000)]
initscript: Simpify initscript and control CTDB via new ctdbd_wrapper

Currently the initscript is very complex.  This makes it hard to read
and hard to add support for new init systems, such as systemd.

Create a wrapper called ctdbd_wrapper to be installed alongside ctdbd.
This is called by the initscript to start and stop ctdbd.  It does the
ctdbd option construct and waits until ctdbd is properly initialised
before it exits.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agorecoverd: Recovery daemon should use ctdb_get_pnn, which can't fail
Martin Schwenke [Mon, 8 Jul 2013 02:45:31 +0000 (12:45 +1000)]
recoverd: Recovery daemon should use ctdb_get_pnn, which can't fail

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Print tdb flags when logging attached to database message
Amitay Isaacs [Wed, 10 Jul 2013 02:23:30 +0000 (12:23 +1000)]
ctdbd: Print tdb flags when logging attached to database message

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Set process names for child processes
Amitay Isaacs [Tue, 9 Jul 2013 02:32:53 +0000 (12:32 +1000)]
ctdbd: Set process names for child processes

This helps distinguish processes in process list in top, perf, etc.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agocommon/system: Add ctdb_set_process_name() function
Amitay Isaacs [Tue, 9 Jul 2013 02:24:59 +0000 (12:24 +1000)]
common/system: Add ctdb_set_process_name() function

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotraverse: Remove unused start_time field
Amitay Isaacs [Thu, 6 Jun 2013 06:29:04 +0000 (16:29 +1000)]
traverse: Remove unused start_time field

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotraverse: Send records directly from traverse child to srcnode
Amitay Isaacs [Thu, 6 Jun 2013 06:26:25 +0000 (16:26 +1000)]
traverse: Send records directly from traverse child to srcnode

Currently CTDB daemon reads records from a child process and then sends them to
srcnode via TRAVERSE_DATA control.  This ties up main CTDB daemon and also
requires an extra copy of the record in the CTDB daemon.  Instead send records
directly from traverse child process.

The control from child process still goes via local CTDB daemon as there
is no infrastructure currently to open a TCP socket to the srcnode.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotraverse: Pass reqid and srcnode information to local database traverse
Amitay Isaacs [Thu, 6 Jun 2013 06:12:07 +0000 (16:12 +1000)]
traverse: Pass reqid and srcnode information to local database traverse

So that traverse child process can directly send the TRAVERSE_DATA control to
the srcnode without first sending it to local node.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: When building with system libraries, add dependency for them
Amitay Isaacs [Mon, 8 Jul 2013 06:14:59 +0000 (16:14 +1000)]
packaging: When building with system libraries, add dependency for them

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: No need for DeadlockTimeout tunable
Amitay Isaacs [Mon, 8 Jul 2013 05:49:58 +0000 (15:49 +1000)]
ctdbd: No need for DeadlockTimeout tunable

The code for deadlock detection and killing smbd process causing deadlock
has been removed and replaced with external debug script.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoinitscript: Export CTDB_DEBUG_LOCKS variable
Amitay Isaacs [Mon, 8 Jul 2013 05:57:22 +0000 (15:57 +1000)]
initscript: Export CTDB_DEBUG_LOCKS variable

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoscripts: Add an example debug_locks.sh script to debug locking issue
Amitay Isaacs [Mon, 8 Jul 2013 05:56:30 +0000 (15:56 +1000)]
scripts: Add an example debug_locks.sh script to debug locking issue

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Use external script to debug locking issues
Amitay Isaacs [Mon, 8 Jul 2013 05:46:53 +0000 (15:46 +1000)]
locking: Use external script to debug locking issues

Use an external script to parse /proc/locks and log useful debugging
information about locks rather than doing that in C code.

To use this feature, add configuration variable to /etc/sysconfig/ctdb:

  CTDB_DEBUG_LOCKS=/etc/ctdb/debug_locks.sh

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Update locking bucket intervals
Amitay Isaacs [Wed, 3 Jul 2013 01:01:21 +0000 (11:01 +1000)]
locking: Update locking bucket intervals

 0   < 1 ms
 1   < 10 ms
 2   < 100 ms
 3   < 1 s
 4   < 2 s
 5   < 4 s
 6   < 8 s
 7   < 16 s
 8   < 32 s
 9   < 64 s
10   >= 64 s

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Update locks latency in CTDB statistics only for RECORD or DB locks
Amitay Isaacs [Wed, 3 Jul 2013 01:46:53 +0000 (11:46 +1000)]
locking: Update locks latency in CTDB statistics only for RECORD or DB locks

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotools/ctdb: Fix the format of DB statistics output
Amitay Isaacs [Tue, 25 Jun 2013 05:36:13 +0000 (15:36 +1000)]
tools/ctdb: Fix the format of DB statistics output

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Remove incomplete ctdb_db_statistics_wire structure
Amitay Isaacs [Tue, 25 Jun 2013 05:25:16 +0000 (15:25 +1000)]
ctdbd: Remove incomplete ctdb_db_statistics_wire structure

Send the ctdb_db_statistics directly instead of first copying it to
duplicate ctdb_db_statistics_wire structure.  This simplifies the
implementation of the control to get database statistics.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Update debug messages for setting readonly property on database
Amitay Isaacs [Wed, 3 Jul 2013 23:04:49 +0000 (09:04 +1000)]
ctdbd: Update debug messages for setting readonly property on database

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Fix buffer overflow error in reloadips
Amitay Isaacs [Fri, 5 Jul 2013 04:04:20 +0000 (14:04 +1000)]
recoverd: Fix buffer overflow error in reloadips

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

10 years agotests/eventscripts: Add some rudimentary tests for 60.ganesha
Martin Schwenke [Thu, 4 Jul 2013 10:02:29 +0000 (20:02 +1000)]
tests/eventscripts: Add some rudimentary tests for 60.ganesha

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: New configuration variable $CTDB_SKIP_GANESHA_NFSD_CHECK
Martin Schwenke [Thu, 4 Jul 2013 06:05:01 +0000 (16:05 +1000)]
eventscripts: New configuration variable $CTDB_SKIP_GANESHA_NFSD_CHECK

This allows 60.ganesha to be unit tested, except for the core Ganesha
monitoring code.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscript: Move Ganesha nfsd monitoring to a function
Martin Schwenke [Thu, 4 Jul 2013 06:00:33 +0000 (16:00 +1000)]
eventscript: Move Ganesha nfsd monitoring to a function

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Drop RPC service version from nfs_check_rpc_service() calls
Martin Schwenke [Thu, 4 Jul 2013 05:11:54 +0000 (15:11 +1000)]
eventscripts: Drop RPC service version from nfs_check_rpc_service() calls

Support for this was removed in commit
77302dbfd85754e02559eccb2dd6c090db0b6b9f and I overlooked its use in
60.ganesha.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: Log something when releasing all IPs
Martin Schwenke [Tue, 2 Jul 2013 04:43:17 +0000 (14:43 +1000)]
ctdbd: Log something when releasing all IPs

At the moment this is silent and it can be confusing to see IPs just
disappear.

Also, this message:

  Been in recovery mode for too long. Dropping all IPS

can cause anxiety when all IPs should already have been dropped.
Adding a comforting message saying that 0 IPs were dropped relieves
such anxiety.  :-)

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Minor style improvements for ctdb_reload_remote_public_ips()
Martin Schwenke [Sun, 30 Jun 2013 09:00:36 +0000 (19:00 +1000)]
recoverd: Minor style improvements for ctdb_reload_remote_public_ips()

* Add a variable to the loop to make the code more readable and have
  it generally fit into 80 columns.

* Improve comments.

* Improve log messages.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Clean up log messages in remote IP verification
Martin Schwenke [Sun, 30 Jun 2013 08:45:46 +0000 (18:45 +1000)]
recoverd: Clean up log messages in remote IP verification

The log messages in verify_remote_ip_allocation() are confusing
because they don't include the PNN of the problem node, because it is
not known in this function.

Add the PNN of the node being verified as a function argument and then
shuffle the log messages around to make them clearer.

Also fold 3 nested if statements into just one.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Fix an unclear log message - "Restart recovery process"
Martin Schwenke [Sun, 30 Jun 2013 07:57:33 +0000 (17:57 +1000)]
recoverd: Fix an unclear log message - "Restart recovery process"

When the recovery master notices a node in recovery mode it starts the
recovery process, it doesn't restart it.

Update documentation to match.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Fix an incorrect comment
Martin Schwenke [Sun, 30 Jun 2013 07:53:37 +0000 (17:53 +1000)]
recoverd: Fix an incorrect comment

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Use ctdb_die() on "setup" event failure
Martin Schwenke [Sun, 30 Jun 2013 07:48:01 +0000 (17:48 +1000)]
ctdbd: Use ctdb_die() on "setup" event failure

This is slightly easier to read because it all fits on 1 line.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Avoid a core dump when "init" event fails
Martin Schwenke [Sun, 30 Jun 2013 07:43:52 +0000 (17:43 +1000)]
ctdbd: Avoid a core dump when "init" event fails

The "init" event only really fails in the scripts, which should log
something useful on failure.  Therefore, a core dump isn't terribly
useful and sometimes attracts unwanted attention.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoutil: New function ctdb_die()
Martin Schwenke [Sun, 30 Jun 2013 07:42:11 +0000 (17:42 +1000)]
util: New function ctdb_die()

This is like ctdb_fatal() but exits cleanly without dumping core or
generating a backtrace.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: When replaying monitor status, don't log empty output
Martin Schwenke [Mon, 24 Jun 2013 09:03:26 +0000 (19:03 +1000)]
eventscripts: When replaying monitor status, don't log empty output

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Release IP callback should fail if the IP is still hosted
Martin Schwenke [Mon, 24 Jun 2013 06:05:03 +0000 (16:05 +1000)]
ctdbd: Release IP callback should fail if the IP is still hosted

At the moment there (at least) are 2 bugs that cause rogue IPs:

* A race where release_ip_callback() runs after a "subsequent" take IP
  has completed.  The IP is back on an interface but we unset
  vnn->iface in the callback.

* A "releaseip" eventscript times out.  We ignore the timeout and call
  it success, deleting the VNN even if the IP is still hosted.

  We could decide not to ignore the timeout and ban the node, but
  killing TCP connections can take a long time and that might result
  in a lot of manning.  We probably won't reinstate banning on
  "releaseip" until killing TCP connections has been optimised.

In both cases, a rogue IP can be avoided by leaving vnn->iface set and
simply failing the control.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: Log warnings in release IP when unexpected interface is encountered
Martin Schwenke [Mon, 24 Jun 2013 05:49:48 +0000 (15:49 +1000)]
ctdbd: Log warnings in release IP when unexpected interface is encountered

Previous code changes work around a potential problems but do not
provide useful information when the a problem occurs.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoping_pong: Validate num_locks argument > 0
Amitay Isaacs [Thu, 4 Jul 2013 07:37:05 +0000 (17:37 +1000)]
ping_pong: Validate num_locks argument > 0

This fixes the floating point error if num_locks = 0.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotests: If connection to ctdb daemon fails, exit
Amitay Isaacs [Thu, 4 Jul 2013 07:27:00 +0000 (17:27 +1000)]
tests: If connection to ctdb daemon fails, exit

This fixes the segmentation error if any of the test code fails to
connect to CTDB daemon.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agobuild: Fix compiler warnings for uninitialized variables
Amitay Isaacs [Thu, 4 Jul 2013 07:00:23 +0000 (17:00 +1000)]
build: Fix compiler warnings for uninitialized variables

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Send the result from child process only once
Amitay Isaacs [Thu, 4 Jul 2013 05:36:29 +0000 (15:36 +1000)]
recoverd: Send the result from child process only once

The result has been sent before the child keeps waiting for parent
ctdbd process.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: Enable compiler optimizations
Amitay Isaacs [Thu, 4 Jul 2013 05:31:52 +0000 (15:31 +1000)]
packaging: Enable compiler optimizations

This reverts d09570c70551aa40390ce9ceffe7bc234e1afafe.

... hoping the segv has been found in last 6 years. :-)

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: Allow building RPMs with system tdb/talloc/tevent
Amitay Isaacs [Thu, 4 Jul 2013 05:14:10 +0000 (15:14 +1000)]
packaging: Allow building RPMs with system tdb/talloc/tevent

To build CTDB RPMs with system installed libraries, use following command:

  ./packaging/RPM/makerpms.sh \
    --with system_talloc \
    --with system_tdb \
    --with system_tevent

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: Do not mark /etc/ctdb/functions as configuration file
Amitay Isaacs [Thu, 4 Jul 2013 04:29:09 +0000 (14:29 +1000)]
packaging: Do not mark /etc/ctdb/functions as configuration file

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: Install README.notify.d using %doc directive
Amitay Isaacs [Thu, 4 Jul 2013 03:19:56 +0000 (13:19 +1000)]
packaging: Install README.notify.d using %doc directive

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

10 years agopackaging: Install docs using %doc directive
Amitay Isaacs [Thu, 4 Jul 2013 02:45:32 +0000 (12:45 +1000)]
packaging: Install docs using %doc directive

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

10 years agopackaging: Remove ctdb_transaction from docdir
Amitay Isaacs [Thu, 4 Jul 2013 01:33:38 +0000 (11:33 +1000)]
packaging: Remove ctdb_transaction from docdir

It's bundled in ctdb-tests package.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agodoc: Add a disclaimer for the EnableBans tunable
Martin Schwenke [Sun, 30 Jun 2013 07:23:08 +0000 (17:23 +1000)]
doc: Add a disclaimer for the EnableBans tunable

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Add banning bug fixes to NEWS
Martin Schwenke [Sun, 30 Jun 2013 07:22:06 +0000 (17:22 +1000)]
doc: Add banning bug fixes to NEWS

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Don't ban self if init or shutdown event fails
Amitay Isaacs [Tue, 2 Jul 2013 02:40:37 +0000 (12:40 +1000)]
ctdbd: Don't ban self if init or shutdown event fails

There is no point in banning the node if init or shutdown event times
out since it's going to quit anyway.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agodoc: The second half of monitoring is only for recovery master
Amitay Isaacs [Thu, 27 Jun 2013 07:46:43 +0000 (17:46 +1000)]
doc: The second half of monitoring is only for recovery master

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: when the recmaster is banned, use that information when forcing an election
Michael Adam [Wed, 26 Jun 2013 07:23:22 +0000 (09:23 +0200)]
recoverd: when the recmaster is banned, use that information when forcing an election

When we trigger an election because the recmaster considers itself inactive,
update our local nodemap with the recmaster's flags before calling
force_election(). This way, we don't send the inactive node freeze commands
(e.g.) that may fail and then lead to ourselves getting banned.

The theory is that this should help avoiding banning loops.

Signed-off-by: Michael Adam <obnox@samba.org>
10 years agorecoverd: fix a comment typo
Michael Adam [Wed, 26 Jun 2013 05:11:51 +0000 (07:11 +0200)]
recoverd: fix a comment typo

Signed-off-by: Michael Adam <obnox@samba.org>
10 years agorecoverd: fix a comment in main_loop
Michael Adam [Fri, 21 Jun 2013 15:57:37 +0000 (17:57 +0200)]
recoverd: fix a comment in main_loop

Signed-off-by: Michael Adam <obnox@samba.org>
10 years agorecoverd: eliminate some trailing spaces from ctdb_election_win()
Michael Adam [Fri, 21 Jun 2013 12:06:22 +0000 (14:06 +0200)]
recoverd: eliminate some trailing spaces from ctdb_election_win()

Signed-off-by: Michael Adam <obnox@samba.org>
10 years agorecoverd: Don't continue if the current node gets banned
Martin Schwenke [Fri, 28 Jun 2013 06:31:07 +0000 (16:31 +1000)]
recoverd: Don't continue if the current node gets banned

Can not continue with recovery or monitoring cluster.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agorecoverd: Refactor code to ban misbehaving nodes
Amitay Isaacs [Fri, 28 Jun 2013 04:31:02 +0000 (14:31 +1000)]
recoverd: Refactor code to ban misbehaving nodes

Since we have nodemap information, there is no need to hardcode the
limit of 20.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

10 years agorecoverd: Move code to ban other nodes after we get local node flags
Amitay Isaacs [Thu, 27 Jun 2013 06:01:16 +0000 (16:01 +1000)]
recoverd: Move code to ban other nodes after we get local node flags

If a node gets banned first, then it should not ban other nodes.

This code was moved up in main_loop to avoid waiting for nodemap
from other nodes (commit 83b0261f2cb453195b86f547d360400103a8b795).

To prevent a banned node from banning other nodes, we need to first get
nodemap information from local node, so trying to ban other nodes can
fail if we are already banned.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Delay the initial election if node is started in stopped state
Amitay Isaacs [Thu, 27 Jun 2013 05:44:27 +0000 (15:44 +1000)]
recoverd: Delay the initial election if node is started in stopped state

Since there is an early exit if a node is stopped or banned, we can wait till
the node becomes active to start initial election.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Update capabilities only if the current node is active
Amitay Isaacs [Thu, 27 Jun 2013 05:33:49 +0000 (15:33 +1000)]
recoverd: Update capabilities only if the current node is active

Since we do an early return if a node is stopped or banned, move update
capabilities code below the early return and just before we check the
capabilities of current recovery master.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: No need to check if node is recovery master when inactive
Amitay Isaacs [Thu, 27 Jun 2013 05:46:04 +0000 (15:46 +1000)]
recoverd: No need to check if node is recovery master when inactive

If a node is stopped or banned, it will cause early return from the
main_loop, so this check is redundent.  The election will called by an
active node.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Always do an early exit from main_loop if node is stopped or banned
Amitay Isaacs [Thu, 27 Jun 2013 05:39:15 +0000 (15:39 +1000)]
recoverd: Always do an early exit from main_loop if node is stopped or banned

A stopped or banned node cannot do anything useful.  So do not participate
in any cluster activity and do not cause any unnecessary network traffic.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Do not set banning credits on a node if current node is inactive
Amitay Isaacs [Fri, 28 Jun 2013 04:10:47 +0000 (14:10 +1000)]
recoverd: Do not set banning credits on a node if current node is inactive

If the current node is banned or stopped, then it should not assign banning
credits to other nodes since the current node will not have up-to-date flags
of other nodes.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agobanning: Do not come out of ban if databases are not frozen
Amitay Isaacs [Mon, 1 Jul 2013 07:40:36 +0000 (17:40 +1000)]
banning: Do not come out of ban if databases are not frozen

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agobanning: No need to check if banned pnn is for local node
Amitay Isaacs [Mon, 24 Jun 2013 04:33:32 +0000 (14:33 +1000)]
banning: No need to check if banned pnn is for local node

If the banned pnn is not the local node, the function returns early.
So no need for additional check.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agobanning: Make ctdb_local_node_got_banned() a void function
Amitay Isaacs [Fri, 28 Jun 2013 04:04:18 +0000 (14:04 +1000)]
banning: Make ctdb_local_node_got_banned() a void function

When this function is called, we are already committed to banning
and there is no point in failing this function.  In case, freezing of
databases fails, it will be fixed from recovery daemon.

10 years agorecoverd: Also check if current node is in recovery when it is banned
Amitay Isaacs [Fri, 28 Jun 2013 04:02:44 +0000 (14:02 +1000)]
recoverd: Also check if current node is in recovery when it is banned

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Set node_flags information as soon as we get nodemap
Amitay Isaacs [Fri, 28 Jun 2013 04:09:35 +0000 (14:09 +1000)]
recoverd: Set node_flags information as soon as we get nodemap

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecovered: Remove old comment as the code corresponding to that has gone away
Amitay Isaacs [Wed, 26 Jun 2013 06:02:23 +0000 (16:02 +1000)]
recovered: Remove old comment as the code corresponding to that has gone away

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agobanning: Log ban state changes for other nodes at higher debug level
Amitay Isaacs [Mon, 24 Jun 2013 04:31:50 +0000 (14:31 +1000)]
banning: Log ban state changes for other nodes at higher debug level

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agofreeze: Make ctdb_start_freeze() a void function
Amitay Isaacs [Mon, 1 Jul 2013 06:28:04 +0000 (16:28 +1000)]
freeze: Make ctdb_start_freeze() a void function

If this function fails due to memory errors, there is no way to recover.
The best course of action is to abort.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agofreeze: If priority is invalid here, it's time to abort
Amitay Isaacs [Mon, 1 Jul 2013 06:21:00 +0000 (16:21 +1000)]
freeze: If priority is invalid here, it's time to abort

ctdb_start_freeze() is called from ctdb_control_freeze() which fixes the
priority if it's 0 and return error if it's invalid.  Other callers of
ctdb_start_freeze() are internal to CTDB.  So if priority is invalid in
ctdb_start_freeze(), definitely something is seriously wrong.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agofreeze: Log message from ctdb_start_freeze() and ctdb_control_freeze()
Amitay Isaacs [Mon, 1 Jul 2013 03:26:33 +0000 (13:26 +1000)]
freeze: Log message from ctdb_start_freeze() and ctdb_control_freeze()

This ensures that whenever databases are frozen either via sending
control or by calling ctdb_start_freeze(), the action is logged.
Since ctdb_control_freeze() calls ctdb_start_freeze(), move logging of
message in early return condition if databases are already frozen.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Print banning message only after verifying pnn
Amitay Isaacs [Mon, 24 Jun 2013 04:18:58 +0000 (14:18 +1000)]
recoverd: Print banning message only after verifying pnn

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: When updating flags on nodes, send updated flags and not old flags
Amitay Isaacs [Wed, 26 Jun 2013 05:22:46 +0000 (15:22 +1000)]
recoverd: When updating flags on nodes, send updated flags and not old flags

This was broken by commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa.
Instead of a SRVID_SET_NODE_FLAGS message to recovery daemon, a control
was sent to the local daemon which in turn informed the recovery daemon.
And while doing this change old flags were sent via CONTROL_MODIFY_FLAGS.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotools/ctdb: Add "force" option to "recover" command
Martin Schwenke [Wed, 26 Jun 2013 04:34:47 +0000 (14:34 +1000)]
tools/ctdb: Add "force" option to "recover" command

At the moment there is no easy way to force a recovery when attempting
to reproduce certain classes of bugs.  This option is added without
documentation because it is dangerous until the bugs are fixed!  :-)

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoclient: Exit with non-zero status when unix socket is closed
Amitay Isaacs [Mon, 24 Jun 2013 07:37:15 +0000 (17:37 +1000)]
client: Exit with non-zero status when unix socket is closed

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agodoc: Fix ctdb ping entry in manpage
Martin Schwenke [Fri, 21 Jun 2013 04:49:20 +0000 (14:49 +1000)]
doc: Fix ctdb ping entry in manpage

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Fix documentation for NoIPTakeover in ctdbd manpage
Martin Schwenke [Fri, 21 Jun 2013 04:47:20 +0000 (14:47 +1000)]
doc: Fix documentation for NoIPTakeover in ctdbd manpage

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Update notification script section in ctdbd manpage
Martin Schwenke [Fri, 21 Jun 2013 04:33:12 +0000 (14:33 +1000)]
doc: Update notification script section in ctdbd manpage

The example notification script is now much more useful.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Add nodestatus command to the ctdb manpage
Martin Schwenke [Fri, 21 Jun 2013 04:32:50 +0000 (14:32 +1000)]
doc: Add nodestatus command to the ctdb manpage

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Update NEWS
Martin Schwenke [Fri, 21 Jun 2013 00:52:05 +0000 (10:52 +1000)]
doc: Update NEWS

Signed-off-by: Martin Schwenke <martin@meltin.net>