Martin Schwenke [Mon, 22 Oct 2012 01:19:07 +0000 (12:19 +1100)]
doc: getlog and clearlog changes for recovery daemon logs
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 18 Oct 2012 03:15:09 +0000 (14:15 +1100)]
tests: Local daemons should use the logging ringbuffer
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 18 Oct 2012 03:13:30 +0000 (14:13 +1100)]
tools/ctdb: Merge recoverd log handling into getlog/clearlog
We don't need extra commands for these.
Also, allow a default value of NOTICE for the getlog level.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 16 Oct 2012 09:57:31 +0000 (20:57 +1100)]
tools/ctdb: Add log ringbuffer handling for recoverd
This adds commands rdgetlog and rdclearlog
These are analogous to getlog and clearlog but operate on the logs for
the recovery daemon.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 16 Oct 2012 09:54:39 +0000 (20:54 +1100)]
recoverd: Add CTDB_SRVID_GETLOG and CTDB_SRVID_CLEARLOG
These support getting and clearing logs from the ring-buffer in the
recovery daemon.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Sun, 21 Oct 2012 22:01:27 +0000 (09:01 +1100)]
build: Set CTDB_PATH to /tmp/ctdb.socket if SOCKPATH is not defined
When building samba with CTDB, if samba configure/waf does not support
setting of SOCKPATH, fallback to /tmp/ctdb.socket.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
David Disseldorp [Thu, 18 Oct 2012 14:55:19 +0000 (16:55 +0200)]
Build: Set the default ctdb socket path at configure time
The ctdb socket path currently defaults to /tmp/ctdb.socket and can be
modified at runtime using the --socket=filename option, common to both
ctdb and ctdbd binaries.
This change allows the default path to be set at configure time using
the --with-socketpath=FILE argument. When not specified, the default
path remains /tmp/ctdb.socket, documentation remains unchanged as a
result.
Signed-off-by: David Disseldorp <ddiss@samba.org>
Amitay Isaacs [Tue, 25 Sep 2012 07:29:50 +0000 (17:29 +1000)]
locking: Do not use ctdb_kill() to kill smbd processes
ctdb_kill() is used to terminate processes spawned by CTDB.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 11 Jul 2012 05:15:41 +0000 (15:15 +1000)]
locking: Add database priority handling for older versions of samba
In samba versions 3.6.x and older, database priorities are not set.
later_db() function implements higher database priority (locking order)
for these databases -
brlock, g_lock, notify_onelevel, serverid, xattr_tdb
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 9 Jul 2012 07:37:35 +0000 (17:37 +1000)]
locking: Schedule a new lock request everytime a lock is released
Since the number of active lock requests is limited to
MAX_LOCK_PROCESSES_PER_DB (= 100), any new requests won't get scheduled
when they are created. So schedule a pending request once current active
request is done.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 14 Jun 2012 06:12:48 +0000 (16:12 +1000)]
ctdbd: Replace lockwait with locking API and remove ctdb_lockwait.c
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 9 May 2012 05:17:21 +0000 (15:17 +1000)]
ctdb_recover: Replace static locking functions with locking API
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 9 May 2012 05:09:51 +0000 (15:09 +1000)]
ctdb_freeze: Replace locking functions with locking API
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 9 May 2012 05:10:20 +0000 (15:10 +1000)]
ctdbd_test: Include ctdb_lock.c code for test stubs
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 17 May 2012 05:25:46 +0000 (15:25 +1000)]
tests: Fix statistics test for new output lines from locking API
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 9 May 2012 02:58:19 +0000 (12:58 +1000)]
tools/ctdb: Display the locking statistics
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Thu, 11 Oct 2012 00:29:29 +0000 (11:29 +1100)]
ctdbd: locking: Provide non-blocking API for locking of TDB record/db/alldb
This introduces a consistent API for handling locks on single record, complete
db or all dbs. The locks are taken out in a child process. In cases of timeout,
find the processes that currently hold the lock and log.
Callback functions for locking requests take locked boolean to indicate
whether the lock was successfully obtained or not.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 6 Jun 2012 01:50:25 +0000 (11:50 +1000)]
common: Add routines to get process and lock information
Currently these functions are implemented only for Linux.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 9 May 2012 02:56:53 +0000 (12:56 +1000)]
header: Added DB statistics update macros
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Tue, 16 Oct 2012 06:04:48 +0000 (17:04 +1100)]
scripts: Refactor logging code in initscript and functions file
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 05:21:02 +0000 (16:21 +1100)]
tools/ctdb_diagnostics: Add "ctdb listvars" output
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 05:18:26 +0000 (16:18 +1100)]
initscript: Check that rc.ctdb is executable before running it
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 05:10:19 +0000 (16:10 +1100)]
ctdbd: Remove references to forcing running of eventscripts from log messages
Running of eventscripts can be initiated from many places, including
the recovery daemon.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 04:59:00 +0000 (15:59 +1100)]
recoverd: Clarify some misleading log messages
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 04:49:13 +0000 (15:49 +1100)]
tools/ctdb: Remove extra header from natgwlist -Y output
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 04:17:54 +0000 (15:17 +1100)]
recoverd: Verifying local IPs should only check for unhosted available IPs
Currently it checks for unhosted IPs among the known IPs rather than
available IPs. This means that a takeover run can be flagged even
when that takeover run will be unable to assign a known, unhosted IP.
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 11 Oct 2012 03:34:37 +0000 (14:34 +1100)]
Revert "Eventscripts - add facility to 10.interface to delete unmanaged IPs"
This reverts commit
88f88d86b0d08240f749fb721b8c401c2eeb1099.
This is dangerous and, on reflection, I can't see it being useful.
There are often permanent IPs on interfaces that CTDB shares with its
public IPs.
Martin Schwenke [Wed, 26 Sep 2012 04:37:49 +0000 (14:37 +1000)]
Eventscripts: "recovered" event should not fail on NATGW failure
The recovery process has no protection against the "recovered" event
failing, so this can cause a recovery loop.
Instead of failing the "recovered" event, add a "monitor" event and
fail that instead. In this case the failure semantics are well
defined.
A separate patch should ban nodes if the "recovered" event fails for
an unknown reason.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 27 Sep 2012 23:39:12 +0000 (09:39 +1000)]
Logging: Map TEVENT_DEBUG_FATAL to DEBUG_CRIT
This is currently mapped to DEBUG_EMERG. CTDB really has no business
logging anything at EMERG level since the whole system is not about to
abort or catch fire. EMERG causes the message to appear on the
console and on every terminal. That's a bit overzealous!
There would be very few situations where logs are being filtered at
level below ERROR, so CRIT should certainly suffice.
The trigger for this was curious messages saying "No event for <n>
seconds!" logged in a user's terminal.
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 6 Sep 2012 10:22:38 +0000 (20:22 +1000)]
common: Debug ctdb_addr_to_str() using new function ctdb_external_trace()
We've seen this function report "Unknown family, 0" and then CTDB
disappeared without a trace. If we can reproduce it then this might
help us to debug it.
The idea is that you do something like the following in /etc/sysconfig/ctdb:
export CTDB_EXTERNAL_TRACE="/etc/ctdb/config/gcore_trace.sh"
When we hit this error than we call out to gcore to get a core file so
we can do forensics. This might block CTDB for a few seconds.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Michael Adam [Wed, 17 Oct 2012 12:21:33 +0000 (14:21 +0200)]
config/functions: fix a comment
ctdb_check_counter_limits does not fail but succeed if count >= limit
Signed-off-by: Michael Adam <obnox@samba.org>
Amitay Isaacs [Wed, 17 Oct 2012 00:38:37 +0000 (11:38 +1100)]
doc: Add info about execute permissions on event scripts
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 17 Oct 2012 00:38:59 +0000 (11:38 +1100)]
doc: Fix documentation for setup event
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Mon, 3 Sep 2012 02:39:36 +0000 (12:39 +1000)]
scripts: Remove duplicate code from init script to set tunables
The tunable variables defined in CTDB configuration file are currently
set up from init script as well as part of "setup" event in 00.ctdb
eventscript. Remove the duplication of this code and set tunable
variables only from setup event. During the "setup" event, it's possible
that ctdb tool commands can timeout if CTDB daemon is not ready. To guard
against such eventuality, wait till "ctdb ping" command succeeds before
executing any other ctdb tool commands.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Amitay Isaacs [Wed, 17 Oct 2012 00:24:57 +0000 (11:24 +1100)]
doc: Fix the hyperlink for "Testing CTDB" page
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Wed, 10 Oct 2012 04:03:06 +0000 (15:03 +1100)]
tests/eventscripts: add unit tests for policy routing reconfigure
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 10 Oct 2012 03:48:59 +0000 (14:48 +1100)]
tests/eventscripts: add extra infrastructure for policy routing tests
Less copying and pasting is a good thing...
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 3 Aug 2012 00:54:30 +0000 (10:54 +1000)]
Eventscripts: Add support for "reconfigure" pseudo-event for policy routing
This rebuilds all policy routes and can be used if the configuration
changes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 24 Sep 2012 04:32:04 +0000 (14:32 +1000)]
recoverd: Track failure of "recovered" event, banning culprits
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 30 Aug 2012 23:34:17 +0000 (09:34 +1000)]
recoverd: When starting a takeover run disable IP verification
Disable for TakeoverTimeout seconds.
Otherwise the the recovery daemon can get overzealous and start trying
to add/delete addresses that it thinks are missing but where the
eventscript just hasn't finished. This didn't used to matter so much
but it is more important now that concurrent takeip/releaseip/updateip
generate error - we want to avoid spamming the log.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 11 Jul 2012 04:46:07 +0000 (14:46 +1000)]
ctdbd: Stop takeovers and releases from colliding in mid-air
There's a race here where release and takeover events for an IP can
run at the same time. For example, a "ctdb deleteip" and a takeover
initiated by the recovery daemon. The timeline is as follows:
1. The release code registers a callback to update the VNN. The
callback is executed *after* the eventscripts run the releaseip
event.
2. The release code calls the eventscripts for the releaseip event,
removing IP from its interface.
The takeover code "updates" the VNN saying that IP is on some
iface.... even if/though the address is already there.
3. The release callback runs, removing the iface associated with IP in
the VNN.
The takeover code calls the eventscripts for the takeip event,
adding IP to an interface.
As a result, CTDB doesn't think it should be hosting IP but IP is on
an interface. The recovery daemon fixes this later... but it
shouldn't happen.
This patch can cause some additional noise in the logs:
Release of IP 10.0.2.133/24 on interface eth2 node:2
recoverd:We are still serving a public address '10.0.2.133' that we should not be serving. Removing it.
Release of IP 10.0.2.133/24 rejected update for this IP already in flight
recoverd:client/ctdb_client.c:2455 ctdb_control for release_ip failed
recoverd:Failed to release local ip address
In this case the node has started releasing an IP when the recovery
daemon notices the addresses is still hosted and initiates another
release. This noise is harmless but annoying.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 28 Aug 2012 05:17:29 +0000 (15:17 +1000)]
ctdbd: New tunable NoIPTakeoverOnDisabled
Stops the behaviour where unhealthy nodes can host IPs when there are
no healthy nodes. Set this to 1 when an immediate complete outage is
preferred when all nodes are unhealthy. The alternative
(i.e. default) can lead to undefined behaviour when the shared
filesystem is unavailable.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 21 Aug 2012 05:52:03 +0000 (15:52 +1000)]
Eventscripts: Add service-start and service-stop pseudo-events
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 15 Aug 2012 05:28:14 +0000 (15:28 +1000)]
ctdbd: Avoid unnecessary updateip event
The existing code makes one fatally bad assumption:
vnn->iface->references can never be -1 (or max-unit32_t in this case).
Right now the reference counting is broken so a reference count of -1
is possible and causes a spurious updateip when vnn->iface is the same
as best_face. This can occur frequently because we get a lot of
redundant takeovers, especially when each IP can only be hosted on one
interface.
This makes the code much more defensive by noting that when best_iface
is the same as vnn->iface there is never a need for an updateip event.
This effectively neuters the updateip code path when IPs can only be
hosted by a single interface.
This should obsolete
6a74515f0a1e24d97cee3ba05d89133aac7ad2b7.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Volker Lendecke [Tue, 9 Oct 2012 09:39:58 +0000 (11:39 +0200)]
Correct include for ctdb_protocol.h
With an old ctdb_protocol.h installed under /usr/local, ctdb will
not compile because the <> form of include will find the header
under /usr/local
Amitay Isaacs [Thu, 20 Sep 2012 07:10:34 +0000 (17:10 +1000)]
Revert "when creating/adding a public ip, set the initial interface to be the first interface specified"
This reverts commit
4308935ba48ac7a29e7523315acf580019715f0f.
This fixes 16_ctdb_config_add_ip.sh test when run against local daemons. When
running against local daemons, if the interface is assigned as soon as an IP is
added, then takeover would never assign this IP address.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Tue, 2 Oct 2012 01:51:24 +0000 (11:51 +1000)]
util: ctdb_fork() closes all sockets opened by the main daemon
Do some other hosuekeeping including stopping tevent.
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 3 Sep 2012 05:37:01 +0000 (15:37 +1000)]
eventscripts: Auto-start/stop services in background
If $CTDB_SERVICE_AUTOSTARTSTOP="yes" then service start/stop is done
in the background with logging.
Fix some unit tests for samba and winbind.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 16 Aug 2012 04:41:11 +0000 (14:41 +1000)]
Eventscripts: split 50.samba into 49.winbind and 50.samba
winbind and samba can be separately managed. This makes the service
starting and stopping code way too complicated, and even adds a small
amount of complexity to the monitoring code. The sensible option is
to split this eventscript in two.
There are two potentially backward incompatible changes here:
* Functionality has been removed that allowed 50.samba to manage
winbind when CTDB_MANAGES_WINBIND was unset but the smb.conf
"security" parameter was set to "ADS" or "DOMAIN".
Maintaining this functionality would have required moving the
testparm-related code to the functions file, deciding where the
cache file should go, and then calling it from both 49.winbind and
50.samba. This feature wasn't of great value and asking
administrators to set an extra variable in exchange for code
simplicity seems like a reasonable deal.
* External code will need to be changed if it calls 50.samba directly
with winbind-related expectations. This is fairly obvious!
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 21 Aug 2012 04:28:37 +0000 (14:28 +1000)]
Initscript: Kill any existing ctdbd processes if the ping succeeds
Initialising a new ctdbd will destroy the Unix domain socket so
existing processes will be useless anyway.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 20 Aug 2012 05:02:24 +0000 (15:02 +1000)]
tools/ctdb: Free the event context
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 20 Aug 2012 04:30:35 +0000 (14:30 +1000)]
libctdb: Add comments to effect that some controls return result in status
These controls include:
CTDB_CONTROL_GET_RECMODE
CTDB_CONTROL_GET_RECMASTER
CTDB_CONTROL_GET_PID
CTDB_CONTROL_GET_PNN
CTDB_CONTROL_PING
CTDB_CONTROL_GET_DB_PRIORITY
In these cases the data field is empty.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 07:05:03 +0000 (17:05 +1000)]
tests/tool: New tests for natgwlist, getcapabilities, lvs, lvsmaster
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 07:02:38 +0000 (17:02 +1000)]
tests/tool: New function setup_natgw() to setup $CTDB_NATGW_NODES
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 06:59:19 +0000 (16:59 +1000)]
tools/ctdb: Clean up control_natgw()
* Factor out repeated code into new function find_natgw()
* Support both machine and human readable output
* Use libctdb
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 06:57:01 +0000 (16:57 +1000)]
tools/ctdb: Convert some commands over to libctdb
control_getcapabilities(), control_lvs(), control_lvsmaster() updated
to use ctdb_getcapabilities(), ctdb_getnodemap() as appropriate.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 05:57:13 +0000 (15:57 +1000)]
tests: libctdb stubs initial ctdb_getcapabilities() implementation
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 05:53:39 +0000 (15:53 +1000)]
tests: libctdb stubs must copy pointers rather than just returning them
Some code (e.g. NAT gateway code) modifies the returned result so was
modifying the original.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 18 Jul 2012 04:24:08 +0000 (14:24 +1000)]
libctdb: add ctdb_getcapabilities()
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 11:25:27 +0000 (21:25 +1000)]
tools/ctdb: Remove redundant filtering loop in control_natgwlist()
This used to catch trailing blank lines. However, these are caught
just as effectively by the whitespace filtering in the loop below.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 11:15:57 +0000 (21:15 +1000)]
tools/ctdb: natgwlist output is either human readable or machine readable
The first line is currently human readable and the rest is machine
readable. This doesn't make sense. Do one or the other...
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 11:09:46 +0000 (21:09 +1000)]
tools/ctdb: Factor out printing of the machine readable status header
It is already in 2 places and we might use it in another.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 16 Jul 2012 04:24:39 +0000 (14:24 +1000)]
tools/ctdb: NAT gateway code should use CTDB_NATGW_NODES
... not NATGW_NODES.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 10:46:58 +0000 (20:46 +1000)]
tests/eventscripts: New policy routing test with invalid table ID
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 10:45:23 +0000 (20:45 +1000)]
tests/eventscripts: Modify ip stub to simulate invalid table ID
This involves refactoring ip_route_check_table() into a new function
ip_check_table() which tables the operation type (i.e. rule/route) as
an argument.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 10:19:37 +0000 (20:19 +1000)]
Eventscripts: Indent error when a route delete fails in 11.per_ip_routing
This puts it under the umbrella of the previous warning that should
also have been printed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 19 Jun 2012 07:20:18 +0000 (17:20 +1000)]
tests/eventscript: unit test for 13.per_ip_routing bogus route removal
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 15 Jun 2012 07:22:02 +0000 (17:22 +1000)]
eventscripts: 13.per_ip_routing should remove bogus routes on ipreallocated
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 13 Jun 2012 03:53:18 +0000 (13:53 +1000)]
tests/eventscripts: Add a policy routing unit test for "ip rule del" failure
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 13 Jun 2012 03:49:49 +0000 (13:49 +1000)]
eventscripts: Print a warning on failure to delete a routing rule
del_routing_for_ip() currently fails silently, which could hide real
errors.
In add_routing_for_ip() we don't want to see any error when calling
del_routing_for_ip(), since we don't expect the rule to be there.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Fri, 17 Aug 2012 03:06:12 +0000 (13:06 +1000)]
doc: Fix path string of /etc/sysconfig/ctdb file
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Fri, 6 Jul 2012 10:43:46 +0000 (20:43 +1000)]
recoverd: All inactive nodes should yield recovery master role
Not just stopped nodes. In reality, this means that banned nodes will
also yield, since nodes in the other inactive states won't be running
a daemon.
This seems sensible since if another node notices that an inactive
node is the recovery master then it will force an election anyway.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 6 Jul 2012 10:36:48 +0000 (20:36 +1000)]
recoverd: An inactive node should not force recovery master elections
An inactive node can't become the recovery master. So if an inactive
node notices that the recovery master is inactive, it shouldn't force
an election for recovery master and nominate itself as a candidate.
This can cause the recovery master to flip-flop between nodes when all
nodes are inactive.
If there is actually an active node then it will trigger the election.
This is fairly cosmetic but is a step along the way towards ironing
out weirdness when all nodes are stopped.
Also, fix a related comment.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 3 Jul 2012 00:30:29 +0000 (10:30 +1000)]
recoverd: main_loop() should not verify local IPs if node is stopped
Doing these checks is pointless and potentially causes unnecessary log
messages.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 3 Jul 2012 00:15:25 +0000 (10:15 +1000)]
recoverd: verify_local_ip_allocation() should dup ifaces before early return
If CTDB starts in STOPPED state then it thinks it is in the middle of
a recovery. rec->ifaces is also NULL and an early exit further down
(that checks to see if a recovery is in process) means that it stays
that way.
However, each time this function is entered the need for a takeover
run is re-flagged. The takeover run never happens due to the the
early exit, causing a couple of unneeded messages to be logged each
time.
This is avoided by moving the code that sets rec->ifaces so that it is
executed earlier and, in this case, in the middle of a recovery.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 2 Jul 2012 07:26:04 +0000 (17:26 +1000)]
recoverd: Update a log message that has bit-rotted
This message used to be correct because the ipreallocated event only
handled updating the NAT gateway. However, that has changed so the
message needs to be updated.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 22 Jun 2012 04:01:02 +0000 (14:01 +1000)]
recoverd: Fix bogus info in message about changed flags
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 30 Jul 2012 02:51:43 +0000 (12:51 +1000)]
tests/eventscripts: Extra cases for policy routing missing config test
Test the startup and monitor events too.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 30 Jul 2012 02:51:12 +0000 (12:51 +1000)]
Eventscripts: 13.per_ip_routing should always fail if config is missing
Currently, if the configuration file is specified by
$CTDB_PER_IP_ROUTING_CONF but is missing, takeip fails but (the
absent) monitor event "succeeds", so the state of a node will
flip-flop.
Instead of this, if the configuration file is missing then fail early
on for all events.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 30 Jul 2012 01:50:53 +0000 (11:50 +1000)]
Revert "Eventscripts - make 13.per_ip_routing fail gracefully if config is missing"
When the configuration file is missing this causes the node to
flip-flop betwen unhealthy (when takeip fails) and healthy (no monitor
event here).
Will reimplement this properly.
This reverts commit
351ca413eec460330571ca8b01ad269728fe15df.
Martin Schwenke [Fri, 6 Jul 2012 10:35:23 +0000 (20:35 +1000)]
ctdb tool: recmaster command might as well be auto-all
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 06:52:04 +0000 (16:52 +1000)]
doc: Document the new onnode -P option
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 06:45:55 +0000 (16:45 +1000)]
tools/onnode: Add -P option to push files to given nodes
A list of files is given rather than a command. These files are
pushed to the specified nodes.
Quoting is fragile/broken so filenames with spaces won't work - you
win some, you lose some. :-)
All of the other onnode options should work together with this option.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 10:13:45 +0000 (20:13 +1000)]
Eventscripts: Clean up 11.routing
The loops can all be done without cat or grep.
The pair of loops in updateip is combined into a single loop.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 3 Jul 2012 21:21:01 +0000 (07:21 +1000)]
ctdbd: Log a meaningful message if the nodes file/list is empty
Right now the message says it can't bind to any of the
addresses... even when there aren't any!
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 2 Jul 2012 07:15:42 +0000 (17:15 +1000)]
ctdbd: Remove the worked "Forced" from message about running eventscripts
The eventscripts are run after a takeover run and in this case they're
not forced. The messages seems to imply that somone has run "ctdb
eventscript" when that is not necessarily the case.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 2 Jul 2012 04:09:32 +0000 (14:09 +1000)]
ctdbd: Fix ctdb_control_release_ip() on local daemons
When running on local daemons no IPs are actually assigned to
interfaces. Commit
9a806dec8687e2ec08a308853b61af6aed5e5d1e broke
ctdb_control_release_ip() for local daemons because it asks the system
which interface the given IP is on, instead of the old behaviour of
trusting CTDB's internal records.
For local deamons (i.e. !ctdb->do_checkpublicip) revert to the old
behaviour of looking up the interface internally. This is good
enough, given that the tests don't tend to misconfigure the addresses.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 17 Jul 2012 05:45:45 +0000 (15:45 +1000)]
Initscript: clean up drop_all_public_ips()
This makes the case implicit where $CTDB_PUBLIC_ADDRESSES is unset.
This is OK because that's not an interesting code path.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 20 Jul 2012 07:00:12 +0000 (17:00 +1000)]
tests/tool: Run ctdb_tool_* under $VALGRIND
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 3 Jul 2012 21:29:18 +0000 (07:29 +1000)]
tests/eventscripts: Rewrite the testparm stub
It currently needs the real testparm command installed even though it
only uses limited features. It is easy enough to fake up the
functionality that 50.samba uses.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 3 Jul 2012 03:05:58 +0000 (13:05 +1000)]
tests/complex: Fix broken ctdb_test_check_real_cluster()
It doesn't set $h at all...
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 2 Jul 2012 04:18:51 +0000 (14:18 +1000)]
tests/simple: ctdb stop/continue tests weren't actually checking IPs
The correct variable is $test_node_ips, not $ips.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 2 Jul 2012 04:06:35 +0000 (14:06 +1000)]
tests: select_test_node_and_ips() should try to avoid failing
Sometimes "ctdb sync" doesn't do its job, so we end up with unassigned
IPs.
If $test_node isn't set then this is bad. However, try a few times to
ensure it is set.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 2 Jul 2012 04:05:21 +0000 (14:05 +1000)]
tests: simple tests against local daemons should check $TEST_LOCAL_DEAMONS
Note the old $CTDB_TEST_REAL_CLUSTER - it doesn't exist anymore...
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 20 Jun 2012 05:57:48 +0000 (15:57 +1000)]
tests: run_tests should exit with $status with -e option
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 14 Jun 2012 09:37:39 +0000 (19:37 +1000)]
tests/simple: ctdb reloadips test should use $test_ip
There's no point recalculating this value.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Thu, 14 Jun 2012 09:36:04 +0000 (19:36 +1000)]
tests: select_test_node_and_ips() should never select non-node -1
Instead of selecting the 1st pnn found, select the 1st one that isn't -1.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Amitay Isaacs [Thu, 26 Jul 2012 12:01:50 +0000 (22:01 +1000)]
util: Do not lock down memory when running with local daemons
Thanks to Ronnie for highlighting the issue of memory lockdown on AIX.
Fix typo, use getuid and not getpid.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Martin Schwenke [Thu, 5 Jul 2012 06:27:54 +0000 (16:27 +1000)]
statd-callout: Fix a bug in the calculations of $STATE
It is just meant to be even, so divided *and* multiplied by 2. Use
$(( )) to make it more readable.
While touching this code, make the related calculation a bit more
readable too.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 24 Jul 2012 01:23:09 +0000 (11:23 +1000)]
Eventscripts: Default route on NAT gateway should have a metric of 10
At the moment routes from 11.routing can fail to be added because they
conflict with the default route added by 11.natgw.
NAT gateway is meant to be a last resort, so routes from 11.routing
should override it.
Signed-off-by: Martin Schwenke <martin@meltin.net>