ctdb.git
11 years agodoc: getlog and clearlog changes for recovery daemon logs
Martin Schwenke [Mon, 22 Oct 2012 01:19:07 +0000 (12:19 +1100)]
doc: getlog and clearlog changes for recovery daemon logs

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: Local daemons should use the logging ringbuffer
Martin Schwenke [Thu, 18 Oct 2012 03:15:09 +0000 (14:15 +1100)]
tests: Local daemons should use the logging ringbuffer

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Merge recoverd log handling into getlog/clearlog
Martin Schwenke [Thu, 18 Oct 2012 03:13:30 +0000 (14:13 +1100)]
tools/ctdb: Merge recoverd log handling into getlog/clearlog

We don't need extra commands for these.

Also, allow a default value of NOTICE for the getlog level.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Add log ringbuffer handling for recoverd
Martin Schwenke [Tue, 16 Oct 2012 09:57:31 +0000 (20:57 +1100)]
tools/ctdb: Add log ringbuffer handling for recoverd

This adds commands rdgetlog and rdclearlog

These are analogous to getlog and clearlog but operate on the logs for
the recovery daemon.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: Add CTDB_SRVID_GETLOG and CTDB_SRVID_CLEARLOG
Martin Schwenke [Tue, 16 Oct 2012 09:54:39 +0000 (20:54 +1100)]
recoverd: Add CTDB_SRVID_GETLOG and CTDB_SRVID_CLEARLOG

These support getting and clearing logs from the ring-buffer in the
recovery daemon.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agobuild: Set CTDB_PATH to /tmp/ctdb.socket if SOCKPATH is not defined
Amitay Isaacs [Sun, 21 Oct 2012 22:01:27 +0000 (09:01 +1100)]
build: Set CTDB_PATH to /tmp/ctdb.socket if SOCKPATH is not defined

When building samba with CTDB, if samba configure/waf does not support
setting of SOCKPATH, fallback to /tmp/ctdb.socket.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoBuild: Set the default ctdb socket path at configure time
David Disseldorp [Thu, 18 Oct 2012 14:55:19 +0000 (16:55 +0200)]
Build: Set the default ctdb socket path at configure time

The ctdb socket path currently defaults to /tmp/ctdb.socket and can be
modified at runtime using the --socket=filename option, common to both
ctdb and ctdbd binaries.

This change allows the default path to be set at configure time using
the --with-socketpath=FILE argument. When not specified, the default
path remains /tmp/ctdb.socket, documentation remains unchanged as a
result.

Signed-off-by: David Disseldorp <ddiss@samba.org>
11 years agolocking: Do not use ctdb_kill() to kill smbd processes
Amitay Isaacs [Tue, 25 Sep 2012 07:29:50 +0000 (17:29 +1000)]
locking: Do not use ctdb_kill() to kill smbd processes

ctdb_kill() is used to terminate processes spawned by CTDB.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agolocking: Add database priority handling for older versions of samba
Amitay Isaacs [Wed, 11 Jul 2012 05:15:41 +0000 (15:15 +1000)]
locking: Add database priority handling for older versions of samba

In samba versions 3.6.x and older, database priorities are not set.
later_db() function implements higher database priority (locking order)
for these databases -
   brlock, g_lock, notify_onelevel, serverid, xattr_tdb

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agolocking: Schedule a new lock request everytime a lock is released
Amitay Isaacs [Mon, 9 Jul 2012 07:37:35 +0000 (17:37 +1000)]
locking: Schedule a new lock request everytime a lock is released

Since the number of active lock requests is limited to
MAX_LOCK_PROCESSES_PER_DB (= 100), any new requests won't get scheduled
when they are created. So schedule a pending request once current active
request is done.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoctdbd: Replace lockwait with locking API and remove ctdb_lockwait.c
Amitay Isaacs [Thu, 14 Jun 2012 06:12:48 +0000 (16:12 +1000)]
ctdbd: Replace lockwait with locking API and remove ctdb_lockwait.c

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoctdb_recover: Replace static locking functions with locking API
Amitay Isaacs [Wed, 9 May 2012 05:17:21 +0000 (15:17 +1000)]
ctdb_recover: Replace static locking functions with locking API

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoctdb_freeze: Replace locking functions with locking API
Amitay Isaacs [Wed, 9 May 2012 05:09:51 +0000 (15:09 +1000)]
ctdb_freeze: Replace locking functions with locking API

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoctdbd_test: Include ctdb_lock.c code for test stubs
Amitay Isaacs [Wed, 9 May 2012 05:10:20 +0000 (15:10 +1000)]
ctdbd_test: Include ctdb_lock.c code for test stubs

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agotests: Fix statistics test for new output lines from locking API
Amitay Isaacs [Thu, 17 May 2012 05:25:46 +0000 (15:25 +1000)]
tests: Fix statistics test for new output lines from locking API

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agotools/ctdb: Display the locking statistics
Amitay Isaacs [Wed, 9 May 2012 02:58:19 +0000 (12:58 +1000)]
tools/ctdb: Display the locking statistics

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoctdbd: locking: Provide non-blocking API for locking of TDB record/db/alldb
Amitay Isaacs [Thu, 11 Oct 2012 00:29:29 +0000 (11:29 +1100)]
ctdbd: locking: Provide non-blocking API for locking of TDB record/db/alldb

This introduces a consistent API for handling locks on single record, complete
db or all dbs. The locks are taken out in a child process. In cases of timeout,
find the processes that currently hold the lock and log.

Callback functions for locking requests take locked boolean to indicate
whether the lock was successfully obtained or not.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agocommon: Add routines to get process and lock information
Amitay Isaacs [Wed, 6 Jun 2012 01:50:25 +0000 (11:50 +1000)]
common: Add routines to get process and lock information

Currently these functions are implemented only for Linux.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoheader: Added DB statistics update macros
Amitay Isaacs [Wed, 9 May 2012 02:56:53 +0000 (12:56 +1000)]
header: Added DB statistics update macros

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoscripts: Refactor logging code in initscript and functions file
Martin Schwenke [Tue, 16 Oct 2012 06:04:48 +0000 (17:04 +1100)]
scripts: Refactor logging code in initscript and functions file

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb_diagnostics: Add "ctdb listvars" output
Martin Schwenke [Thu, 11 Oct 2012 05:21:02 +0000 (16:21 +1100)]
tools/ctdb_diagnostics: Add "ctdb listvars" output

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoinitscript: Check that rc.ctdb is executable before running it
Martin Schwenke [Thu, 11 Oct 2012 05:18:26 +0000 (16:18 +1100)]
initscript: Check that rc.ctdb is executable before running it

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: Remove references to forcing running of eventscripts from log messages
Martin Schwenke [Thu, 11 Oct 2012 05:10:19 +0000 (16:10 +1100)]
ctdbd: Remove references to forcing running of eventscripts from log messages

Running of eventscripts can be initiated from many places, including
the recovery daemon.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: Clarify some misleading log messages
Martin Schwenke [Thu, 11 Oct 2012 04:59:00 +0000 (15:59 +1100)]
recoverd: Clarify some misleading log messages

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Remove extra header from natgwlist -Y output
Martin Schwenke [Thu, 11 Oct 2012 04:49:13 +0000 (15:49 +1100)]
tools/ctdb: Remove extra header from natgwlist -Y output

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: Verifying local IPs should only check for unhosted available IPs
Martin Schwenke [Thu, 11 Oct 2012 04:17:54 +0000 (15:17 +1100)]
recoverd: Verifying local IPs should only check for unhosted available IPs

Currently it checks for unhosted IPs among the known IPs rather than
available IPs.  This means that a takeover run can be flagged even
when that takeover run will be unable to assign a known, unhosted IP.

Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoRevert "Eventscripts - add facility to 10.interface to delete unmanaged IPs"
Martin Schwenke [Thu, 11 Oct 2012 03:34:37 +0000 (14:34 +1100)]
Revert "Eventscripts - add facility to 10.interface to delete unmanaged IPs"

This reverts commit 88f88d86b0d08240f749fb721b8c401c2eeb1099.

This is dangerous and, on reflection, I can't see it being useful.
There are often permanent IPs on interfaces that CTDB shares with its
public IPs.

11 years agoEventscripts: "recovered" event should not fail on NATGW failure
Martin Schwenke [Wed, 26 Sep 2012 04:37:49 +0000 (14:37 +1000)]
Eventscripts: "recovered" event should not fail on NATGW failure

The recovery process has no protection against the "recovered" event
failing, so this can cause a recovery loop.

Instead of failing the "recovered" event, add a "monitor" event and
fail that instead.  In this case the failure semantics are well
defined.

A separate patch should ban nodes if the "recovered" event fails for
an unknown reason.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoLogging: Map TEVENT_DEBUG_FATAL to DEBUG_CRIT
Martin Schwenke [Thu, 27 Sep 2012 23:39:12 +0000 (09:39 +1000)]
Logging: Map TEVENT_DEBUG_FATAL to DEBUG_CRIT

This is currently mapped to DEBUG_EMERG.  CTDB really has no business
logging anything at EMERG level since the whole system is not about to
abort or catch fire.  EMERG causes the message to appear on the
console and on every terminal.  That's a bit overzealous!

There would be very few situations where logs are being filtered at
level below ERROR, so CRIT should certainly suffice.

The trigger for this was curious messages saying "No event for <n>
seconds!" logged in a user's terminal.

Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agocommon: Debug ctdb_addr_to_str() using new function ctdb_external_trace()
Martin Schwenke [Thu, 6 Sep 2012 10:22:38 +0000 (20:22 +1000)]
common: Debug ctdb_addr_to_str() using new function ctdb_external_trace()

We've seen this function report "Unknown family, 0" and then CTDB
disappeared without a trace.  If we can reproduce it then this might
help us to debug it.

The idea is that you do something like the following in /etc/sysconfig/ctdb:

  export CTDB_EXTERNAL_TRACE="/etc/ctdb/config/gcore_trace.sh"

When we hit this error than we call out to gcore to get a core file so
we can do forensics.  This might block CTDB for a few seconds.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoconfig/functions: fix a comment
Michael Adam [Wed, 17 Oct 2012 12:21:33 +0000 (14:21 +0200)]
config/functions: fix a comment

ctdb_check_counter_limits does not fail but succeed if count >= limit

Signed-off-by: Michael Adam <obnox@samba.org>
11 years agodoc: Add info about execute permissions on event scripts
Amitay Isaacs [Wed, 17 Oct 2012 00:38:37 +0000 (11:38 +1100)]
doc: Add info about execute permissions on event scripts

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agodoc: Fix documentation for setup event
Amitay Isaacs [Wed, 17 Oct 2012 00:38:59 +0000 (11:38 +1100)]
doc: Fix documentation for setup event

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoscripts: Remove duplicate code from init script to set tunables
Amitay Isaacs [Mon, 3 Sep 2012 02:39:36 +0000 (12:39 +1000)]
scripts: Remove duplicate code from init script to set tunables

The tunable variables defined in CTDB configuration file are currently
set up from init script as well as part of "setup" event in 00.ctdb
eventscript.  Remove the duplication of this code and set tunable
variables only from setup event.  During the "setup" event, it's possible
that ctdb tool commands can timeout if CTDB daemon is not ready.  To guard
against such eventuality, wait till "ctdb ping" command succeeds before
executing any other ctdb tool commands.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agodoc: Fix the hyperlink for "Testing CTDB" page
Amitay Isaacs [Wed, 17 Oct 2012 00:24:57 +0000 (11:24 +1100)]
doc: Fix the hyperlink for "Testing CTDB" page

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agotests/eventscripts: add unit tests for policy routing reconfigure
Martin Schwenke [Wed, 10 Oct 2012 04:03:06 +0000 (15:03 +1100)]
tests/eventscripts: add unit tests for policy routing reconfigure

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscripts: add extra infrastructure for policy routing tests
Martin Schwenke [Wed, 10 Oct 2012 03:48:59 +0000 (14:48 +1100)]
tests/eventscripts: add extra infrastructure for policy routing tests

Less copying and pasting is a good thing...

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: Add support for "reconfigure" pseudo-event for policy routing
Martin Schwenke [Fri, 3 Aug 2012 00:54:30 +0000 (10:54 +1000)]
Eventscripts: Add support for "reconfigure" pseudo-event for policy routing

This rebuilds all policy routes and can be used if the configuration
changes.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: Track failure of "recovered" event, banning culprits
Martin Schwenke [Mon, 24 Sep 2012 04:32:04 +0000 (14:32 +1000)]
recoverd: Track failure of "recovered" event, banning culprits

Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: When starting a takeover run disable IP verification
Martin Schwenke [Thu, 30 Aug 2012 23:34:17 +0000 (09:34 +1000)]
recoverd: When starting a takeover run disable IP verification

Disable for TakeoverTimeout seconds.

Otherwise the the recovery daemon can get overzealous and start trying
to add/delete addresses that it thinks are missing but where the
eventscript just hasn't finished.  This didn't used to matter so much
but it is more important now that concurrent takeip/releaseip/updateip
generate error - we want to avoid spamming the log.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: Stop takeovers and releases from colliding in mid-air
Martin Schwenke [Wed, 11 Jul 2012 04:46:07 +0000 (14:46 +1000)]
ctdbd: Stop takeovers and releases from colliding in mid-air

There's a race here where release and takeover events for an IP can
run at the same time.  For example, a "ctdb deleteip" and a takeover
initiated by the recovery daemon.  The timeline is as follows:

1. The release code registers a callback to update the VNN.  The
   callback is executed *after* the eventscripts run the releaseip
   event.

2. The release code calls the eventscripts for the releaseip event,
   removing IP from its interface.

   The takeover code "updates" the VNN saying that IP is on some
   iface.... even if/though the address is already there.

3. The release callback runs, removing the iface associated with IP in
   the VNN.

   The takeover code calls the eventscripts for the takeip event,
   adding IP to an interface.

As a result, CTDB doesn't think it should be hosting IP but IP is on
an interface.  The recovery daemon fixes this later... but it
shouldn't happen.

This patch can cause some additional noise in the logs:

  Release of IP 10.0.2.133/24 on interface eth2  node:2
  recoverd:We are still serving a public address '10.0.2.133' that we should not be serving. Removing it.
  Release of IP 10.0.2.133/24 rejected update for this IP already in flight
  recoverd:client/ctdb_client.c:2455 ctdb_control for release_ip failed
  recoverd:Failed to release local ip address

In this case the node has started releasing an IP when the recovery
daemon notices the addresses is still hosted and initiates another
release.  This noise is harmless but annoying.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: New tunable NoIPTakeoverOnDisabled
Martin Schwenke [Tue, 28 Aug 2012 05:17:29 +0000 (15:17 +1000)]
ctdbd: New tunable NoIPTakeoverOnDisabled

Stops the behaviour where unhealthy nodes can host IPs when there are
no healthy nodes.  Set this to 1 when an immediate complete outage is
preferred when all nodes are unhealthy.  The alternative
(i.e. default) can lead to undefined behaviour when the shared
filesystem is unavailable.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: Add service-start and service-stop pseudo-events
Martin Schwenke [Tue, 21 Aug 2012 05:52:03 +0000 (15:52 +1000)]
Eventscripts: Add service-start and service-stop pseudo-events

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: Avoid unnecessary updateip event
Martin Schwenke [Wed, 15 Aug 2012 05:28:14 +0000 (15:28 +1000)]
ctdbd: Avoid unnecessary updateip event

The existing code makes one fatally bad assumption:
vnn->iface->references can never be -1 (or max-unit32_t in this case).
Right now the reference counting is broken so a reference count of -1
is possible and causes a spurious updateip when vnn->iface is the same
as best_face.  This can occur frequently because we get a lot of
redundant takeovers, especially when each IP can only be hosted on one
interface.

This makes the code much more defensive by noting that when best_iface
is the same as vnn->iface there is never a need for an updateip event.
This effectively neuters the updateip code path when IPs can only be
hosted by a single interface.

This should obsolete 6a74515f0a1e24d97cee3ba05d89133aac7ad2b7.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoCorrect include for ctdb_protocol.h
Volker Lendecke [Tue, 9 Oct 2012 09:39:58 +0000 (11:39 +0200)]
Correct include for ctdb_protocol.h

With an old ctdb_protocol.h installed under /usr/local, ctdb will
not compile because the <> form of include will find the header
under /usr/local

11 years agoRevert "when creating/adding a public ip, set the initial interface to be the first...
Amitay Isaacs [Thu, 20 Sep 2012 07:10:34 +0000 (17:10 +1000)]
Revert "when creating/adding a public ip, set the initial interface to be the first interface specified"

This reverts commit 4308935ba48ac7a29e7523315acf580019715f0f.

This fixes 16_ctdb_config_add_ip.sh test when run against local daemons. When
running against local daemons, if the interface is assigned as soon as an IP is
added, then takeover would never assign this IP address.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agoutil: ctdb_fork() closes all sockets opened by the main daemon
Martin Schwenke [Tue, 2 Oct 2012 01:51:24 +0000 (11:51 +1000)]
util: ctdb_fork() closes all sockets opened by the main daemon

Do some other hosuekeeping including stopping tevent.

Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoeventscripts: Auto-start/stop services in background
Martin Schwenke [Mon, 3 Sep 2012 05:37:01 +0000 (15:37 +1000)]
eventscripts: Auto-start/stop services in background

If $CTDB_SERVICE_AUTOSTARTSTOP="yes" then service start/stop is done
in the background with logging.

Fix some unit tests for samba and winbind.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: split 50.samba into 49.winbind and 50.samba
Martin Schwenke [Thu, 16 Aug 2012 04:41:11 +0000 (14:41 +1000)]
Eventscripts: split 50.samba into 49.winbind and 50.samba

winbind and samba can be separately managed.  This makes the service
starting and stopping code way too complicated, and even adds a small
amount of complexity to the monitoring code.  The sensible option is
to split this eventscript in two.

There are two potentially backward incompatible changes here:

* Functionality has been removed that allowed 50.samba to manage
  winbind when CTDB_MANAGES_WINBIND was unset but the smb.conf
  "security" parameter was set to "ADS" or "DOMAIN".

  Maintaining this functionality would have required moving the
  testparm-related code to the functions file, deciding where the
  cache file should go, and then calling it from both 49.winbind and
  50.samba.  This feature wasn't of great value and asking
  administrators to set an extra variable in exchange for code
  simplicity seems like a reasonable deal.

* External code will need to be changed if it calls 50.samba directly
  with winbind-related expectations.  This is fairly obvious!

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoInitscript: Kill any existing ctdbd processes if the ping succeeds
Martin Schwenke [Tue, 21 Aug 2012 04:28:37 +0000 (14:28 +1000)]
Initscript: Kill any existing ctdbd processes if the ping succeeds

Initialising a new ctdbd will destroy the Unix domain socket so
existing processes will be useless anyway.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Free the event context
Martin Schwenke [Mon, 20 Aug 2012 05:02:24 +0000 (15:02 +1000)]
tools/ctdb: Free the event context

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agolibctdb: Add comments to effect that some controls return result in status
Martin Schwenke [Mon, 20 Aug 2012 04:30:35 +0000 (14:30 +1000)]
libctdb: Add comments to effect that some controls return result in status

These controls include:

  CTDB_CONTROL_GET_RECMODE
  CTDB_CONTROL_GET_RECMASTER
  CTDB_CONTROL_GET_PID
  CTDB_CONTROL_GET_PNN
  CTDB_CONTROL_PING
  CTDB_CONTROL_GET_DB_PRIORITY

In these cases the data field is empty.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/tool: New tests for natgwlist, getcapabilities, lvs, lvsmaster
Martin Schwenke [Wed, 18 Jul 2012 07:05:03 +0000 (17:05 +1000)]
tests/tool: New tests for natgwlist, getcapabilities, lvs, lvsmaster

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/tool: New function setup_natgw() to setup $CTDB_NATGW_NODES
Martin Schwenke [Wed, 18 Jul 2012 07:02:38 +0000 (17:02 +1000)]
tests/tool: New function setup_natgw() to setup $CTDB_NATGW_NODES

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Clean up control_natgw()
Martin Schwenke [Wed, 18 Jul 2012 06:59:19 +0000 (16:59 +1000)]
tools/ctdb: Clean up control_natgw()

* Factor out repeated code into new function find_natgw()
* Support both machine and human readable output
* Use libctdb

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Convert some commands over to libctdb
Martin Schwenke [Wed, 18 Jul 2012 06:57:01 +0000 (16:57 +1000)]
tools/ctdb: Convert some commands over to libctdb

control_getcapabilities(), control_lvs(), control_lvsmaster() updated
to use ctdb_getcapabilities(), ctdb_getnodemap() as appropriate.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: libctdb stubs initial ctdb_getcapabilities() implementation
Martin Schwenke [Wed, 18 Jul 2012 05:57:13 +0000 (15:57 +1000)]
tests: libctdb stubs initial ctdb_getcapabilities() implementation

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: libctdb stubs must copy pointers rather than just returning them
Martin Schwenke [Wed, 18 Jul 2012 05:53:39 +0000 (15:53 +1000)]
tests: libctdb stubs must copy pointers rather than just returning them

Some code (e.g. NAT gateway code) modifies the returned result so was
modifying the original.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agolibctdb: add ctdb_getcapabilities()
Martin Schwenke [Wed, 18 Jul 2012 04:24:08 +0000 (14:24 +1000)]
libctdb: add ctdb_getcapabilities()

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Remove redundant filtering loop in control_natgwlist()
Martin Schwenke [Tue, 17 Jul 2012 11:25:27 +0000 (21:25 +1000)]
tools/ctdb: Remove redundant filtering loop in control_natgwlist()

This used to catch trailing blank lines.  However, these are caught
just as effectively by the whitespace filtering in the loop below.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: natgwlist output is either human readable or machine readable
Martin Schwenke [Tue, 17 Jul 2012 11:15:57 +0000 (21:15 +1000)]
tools/ctdb: natgwlist output is either human readable or machine readable

The first line is currently human readable and the rest is machine
readable.  This doesn't make sense.  Do one or the other...

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: Factor out printing of the machine readable status header
Martin Schwenke [Tue, 17 Jul 2012 11:09:46 +0000 (21:09 +1000)]
tools/ctdb: Factor out printing of the machine readable status header

It is already in 2 places and we might use it in another.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/ctdb: NAT gateway code should use CTDB_NATGW_NODES
Martin Schwenke [Mon, 16 Jul 2012 04:24:39 +0000 (14:24 +1000)]
tools/ctdb: NAT gateway code should use CTDB_NATGW_NODES

... not NATGW_NODES.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscripts: New policy routing test with invalid table ID
Martin Schwenke [Tue, 17 Jul 2012 10:46:58 +0000 (20:46 +1000)]
tests/eventscripts: New policy routing test with invalid table ID

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscripts: Modify ip stub to simulate invalid table ID
Martin Schwenke [Tue, 17 Jul 2012 10:45:23 +0000 (20:45 +1000)]
tests/eventscripts: Modify ip stub to simulate invalid table ID

This involves refactoring ip_route_check_table() into a new function
ip_check_table() which tables the operation type (i.e. rule/route) as
an argument.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: Indent error when a route delete fails in 11.per_ip_routing
Martin Schwenke [Tue, 17 Jul 2012 10:19:37 +0000 (20:19 +1000)]
Eventscripts: Indent error when a route delete fails in 11.per_ip_routing

This puts it under the umbrella of the previous warning that should
also have been printed.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscript: unit test for 13.per_ip_routing bogus route removal
Martin Schwenke [Tue, 19 Jun 2012 07:20:18 +0000 (17:20 +1000)]
tests/eventscript: unit test for 13.per_ip_routing bogus route removal

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoeventscripts: 13.per_ip_routing should remove bogus routes on ipreallocated
Martin Schwenke [Fri, 15 Jun 2012 07:22:02 +0000 (17:22 +1000)]
eventscripts: 13.per_ip_routing should remove bogus routes on ipreallocated

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscripts: Add a policy routing unit test for "ip rule del" failure
Martin Schwenke [Wed, 13 Jun 2012 03:53:18 +0000 (13:53 +1000)]
tests/eventscripts: Add a policy routing unit test for "ip rule del" failure

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoeventscripts: Print a warning on failure to delete a routing rule
Martin Schwenke [Wed, 13 Jun 2012 03:49:49 +0000 (13:49 +1000)]
eventscripts: Print a warning on failure to delete a routing rule

del_routing_for_ip() currently fails silently, which could hide real
errors.

In add_routing_for_ip() we don't want to see any error when calling
del_routing_for_ip(), since we don't expect the rule to be there.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agodoc: Fix path string of /etc/sysconfig/ctdb file
Amitay Isaacs [Fri, 17 Aug 2012 03:06:12 +0000 (13:06 +1000)]
doc: Fix path string of /etc/sysconfig/ctdb file

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agorecoverd: All inactive nodes should yield recovery master role
Martin Schwenke [Fri, 6 Jul 2012 10:43:46 +0000 (20:43 +1000)]
recoverd: All inactive nodes should yield recovery master role

Not just stopped nodes.  In reality, this means that banned nodes will
also yield, since nodes in the other inactive states won't be running
a daemon.

This seems sensible since if another node notices that an inactive
node is the recovery master then it will force an election anyway.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: An inactive node should not force recovery master elections
Martin Schwenke [Fri, 6 Jul 2012 10:36:48 +0000 (20:36 +1000)]
recoverd: An inactive node should not force recovery master elections

An inactive node can't become the recovery master.  So if an inactive
node notices that the recovery master is inactive, it shouldn't force
an election for recovery master and nominate itself as a candidate.
This can cause the recovery master to flip-flop between nodes when all
nodes are inactive.

If there is actually an active node then it will trigger the election.

This is fairly cosmetic but is a step along the way towards ironing
out weirdness when all nodes are stopped.

Also, fix a related comment.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: main_loop() should not verify local IPs if node is stopped
Martin Schwenke [Tue, 3 Jul 2012 00:30:29 +0000 (10:30 +1000)]
recoverd: main_loop() should not verify local IPs if node is stopped

Doing these checks is pointless and potentially causes unnecessary log
messages.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: verify_local_ip_allocation() should dup ifaces before early return
Martin Schwenke [Tue, 3 Jul 2012 00:15:25 +0000 (10:15 +1000)]
recoverd: verify_local_ip_allocation() should dup ifaces before early return

If CTDB starts in STOPPED state then it thinks it is in the middle of
a recovery.  rec->ifaces is also NULL and an early exit further down
(that checks to see if a recovery is in process) means that it stays
that way.

However, each time this function is entered the need for a takeover
run is re-flagged.  The takeover run never happens due to the the
early exit, causing a couple of unneeded messages to be logged each
time.

This is avoided by moving the code that sets rec->ifaces so that it is
executed earlier and, in this case, in the middle of a recovery.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: Update a log message that has bit-rotted
Martin Schwenke [Mon, 2 Jul 2012 07:26:04 +0000 (17:26 +1000)]
recoverd: Update a log message that has bit-rotted

This message used to be correct because the ipreallocated event only
handled updating the NAT gateway.  However, that has changed so the
message needs to be updated.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agorecoverd: Fix bogus info in message about changed flags
Martin Schwenke [Fri, 22 Jun 2012 04:01:02 +0000 (14:01 +1000)]
recoverd: Fix bogus info in message about changed flags

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscripts: Extra cases for policy routing missing config test
Martin Schwenke [Mon, 30 Jul 2012 02:51:43 +0000 (12:51 +1000)]
tests/eventscripts: Extra cases for policy routing missing config test

Test the startup and monitor events too.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: 13.per_ip_routing should always fail if config is missing
Martin Schwenke [Mon, 30 Jul 2012 02:51:12 +0000 (12:51 +1000)]
Eventscripts: 13.per_ip_routing should always fail if config is missing

Currently, if the configuration file is specified by
$CTDB_PER_IP_ROUTING_CONF but is missing, takeip fails but (the
absent) monitor event "succeeds", so the state of a node will
flip-flop.

Instead of this, if the configuration file is missing then fail early
on for all events.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoRevert "Eventscripts - make 13.per_ip_routing fail gracefully if config is missing"
Martin Schwenke [Mon, 30 Jul 2012 01:50:53 +0000 (11:50 +1000)]
Revert "Eventscripts - make 13.per_ip_routing fail gracefully if config is missing"

When the configuration file is missing this causes the node to
flip-flop betwen unhealthy (when takeip fails) and healthy (no monitor
event here).

Will reimplement this properly.

This reverts commit 351ca413eec460330571ca8b01ad269728fe15df.

11 years agoctdb tool: recmaster command might as well be auto-all
Martin Schwenke [Fri, 6 Jul 2012 10:35:23 +0000 (20:35 +1000)]
ctdb tool: recmaster command might as well be auto-all

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agodoc: Document the new onnode -P option
Martin Schwenke [Tue, 17 Jul 2012 06:52:04 +0000 (16:52 +1000)]
doc: Document the new onnode -P option

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotools/onnode: Add -P option to push files to given nodes
Martin Schwenke [Tue, 17 Jul 2012 06:45:55 +0000 (16:45 +1000)]
tools/onnode: Add -P option to push files to given nodes

A list of files is given rather than a command.  These files are
pushed to the specified nodes.

Quoting is fragile/broken so filenames with spaces won't work - you
win some, you lose some.  :-)

All of the other onnode options should work together with this option.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: Clean up 11.routing
Martin Schwenke [Tue, 17 Jul 2012 10:13:45 +0000 (20:13 +1000)]
Eventscripts: Clean up 11.routing

The loops can all be done without cat or grep.

The pair of loops in updateip is combined into a single loop.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: Log a meaningful message if the nodes file/list is empty
Martin Schwenke [Tue, 3 Jul 2012 21:21:01 +0000 (07:21 +1000)]
ctdbd: Log a meaningful message if the nodes file/list is empty

Right now the message says it can't bind to any of the
addresses... even when there aren't any!

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: Remove the worked "Forced" from message about running eventscripts
Martin Schwenke [Mon, 2 Jul 2012 07:15:42 +0000 (17:15 +1000)]
ctdbd: Remove the worked "Forced" from message about running eventscripts

The eventscripts are run after a takeover run and in this case they're
not forced.  The messages seems to imply that somone has run "ctdb
eventscript" when that is not necessarily the case.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoctdbd: Fix ctdb_control_release_ip() on local daemons
Martin Schwenke [Mon, 2 Jul 2012 04:09:32 +0000 (14:09 +1000)]
ctdbd: Fix ctdb_control_release_ip() on local daemons

When running on local daemons no IPs are actually assigned to
interfaces.  Commit 9a806dec8687e2ec08a308853b61af6aed5e5d1e broke
ctdb_control_release_ip() for local daemons because it asks the system
which interface the given IP is on, instead of the old behaviour of
trusting CTDB's internal records.

For local deamons (i.e. !ctdb->do_checkpublicip) revert to the old
behaviour of looking up the interface internally.  This is good
enough, given that the tests don't tend to misconfigure the addresses.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoInitscript: clean up drop_all_public_ips()
Martin Schwenke [Tue, 17 Jul 2012 05:45:45 +0000 (15:45 +1000)]
Initscript: clean up drop_all_public_ips()

This makes the case implicit where $CTDB_PUBLIC_ADDRESSES is unset.
This is OK because that's not an interesting code path.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/tool: Run ctdb_tool_* under $VALGRIND
Martin Schwenke [Fri, 20 Jul 2012 07:00:12 +0000 (17:00 +1000)]
tests/tool: Run ctdb_tool_* under $VALGRIND

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/eventscripts: Rewrite the testparm stub
Martin Schwenke [Tue, 3 Jul 2012 21:29:18 +0000 (07:29 +1000)]
tests/eventscripts: Rewrite the testparm stub

It currently needs the real testparm command installed even though it
only uses limited features.  It is easy enough to fake up the
functionality that 50.samba uses.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/complex: Fix broken ctdb_test_check_real_cluster()
Martin Schwenke [Tue, 3 Jul 2012 03:05:58 +0000 (13:05 +1000)]
tests/complex: Fix broken ctdb_test_check_real_cluster()

It doesn't set $h at all...

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/simple: ctdb stop/continue tests weren't actually checking IPs
Martin Schwenke [Mon, 2 Jul 2012 04:18:51 +0000 (14:18 +1000)]
tests/simple: ctdb stop/continue tests weren't actually checking IPs

The correct variable is $test_node_ips, not $ips.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: select_test_node_and_ips() should try to avoid failing
Martin Schwenke [Mon, 2 Jul 2012 04:06:35 +0000 (14:06 +1000)]
tests: select_test_node_and_ips() should try to avoid failing

Sometimes "ctdb sync" doesn't do its job, so we end up with unassigned
IPs.

If $test_node isn't set then this is bad.  However, try a few times to
ensure it is set.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: simple tests against local daemons should check $TEST_LOCAL_DEAMONS
Martin Schwenke [Mon, 2 Jul 2012 04:05:21 +0000 (14:05 +1000)]
tests: simple tests against local daemons should check $TEST_LOCAL_DEAMONS

Note the old $CTDB_TEST_REAL_CLUSTER - it doesn't exist anymore...

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: run_tests should exit with $status with -e option
Martin Schwenke [Wed, 20 Jun 2012 05:57:48 +0000 (15:57 +1000)]
tests: run_tests should exit with $status with -e option

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests/simple: ctdb reloadips test should use $test_ip
Martin Schwenke [Thu, 14 Jun 2012 09:37:39 +0000 (19:37 +1000)]
tests/simple: ctdb reloadips test should use $test_ip

There's no point recalculating this value.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agotests: select_test_node_and_ips() should never select non-node -1
Martin Schwenke [Thu, 14 Jun 2012 09:36:04 +0000 (19:36 +1000)]
tests:  select_test_node_and_ips() should never select non-node -1

Instead of selecting the 1st pnn found, select the 1st one that isn't -1.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoutil: Do not lock down memory when running with local daemons
Amitay Isaacs [Thu, 26 Jul 2012 12:01:50 +0000 (22:01 +1000)]
util: Do not lock down memory when running with local daemons

Thanks to Ronnie for highlighting the issue of memory lockdown on AIX.
Fix typo, use getuid and not getpid.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
11 years agostatd-callout: Fix a bug in the calculations of $STATE
Martin Schwenke [Thu, 5 Jul 2012 06:27:54 +0000 (16:27 +1000)]
statd-callout: Fix a bug in the calculations of $STATE

It is just meant to be even, so divided *and* multiplied by 2.  Use
$(( )) to make it more readable.

While touching this code, make the related calculation a bit more
readable too.

Signed-off-by: Martin Schwenke <martin@meltin.net>
11 years agoEventscripts: Default route on NAT gateway should have a metric of 10
Martin Schwenke [Tue, 24 Jul 2012 01:23:09 +0000 (11:23 +1000)]
Eventscripts: Default route on NAT gateway should have a metric of 10

At the moment routes from 11.routing can fail to be added because they
conflict with the default route added by 11.natgw.

NAT gateway is meant to be a last resort, so routes from 11.routing
should override it.

Signed-off-by: Martin Schwenke <martin@meltin.net>