metze/ctdb/wip.git
8 years agoTest suite: Fix NFS tickle test.
Martin Schwenke [Fri, 27 Aug 2010 01:40:44 +0000 (11:40 +1000)]
Test suite: Fix NFS tickle test.

We now kill ctdbd on the test node instead of disabling it.  This
ensures that the only tickles we see will come from the takeover node.

We also sleep for TickleUpdateInterval before checking for asking ctdb
about the tickles.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: Tweak NFS tickle test.
Martin Schwenke [Thu, 26 Aug 2010 07:56:50 +0000 (17:56 +1000)]
Test suite: Tweak NFS tickle test.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: Fix typos in NFS tickle test.
Martin Schwenke [Thu, 26 Aug 2010 05:50:35 +0000 (15:50 +1000)]
Test suite: Fix typos in NFS tickle test.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: NFS tickle test uses gettickles if events.d/61.nfstickle missing.
Martin Schwenke [Thu, 26 Aug 2010 05:28:19 +0000 (15:28 +1000)]
Test suite: NFS tickle test uses gettickles if events.d/61.nfstickle missing.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoNFS tickles: use addtickle/deltickle instead of shared tickle directory.
Martin Schwenke [Thu, 26 Aug 2010 04:59:59 +0000 (14:59 +1000)]
NFS tickles: use addtickle/deltickle instead of shared tickle directory.

This adds a new function update_tickles() that tracks tickles for a
given port using the new ctdb addtickle/deltickle commands.  This
function is used in events.d/60.nfs to handle NFS tickles.

events.d/61.nfstickle is removed.  The
/proc/sys/net/ipv4/tcp_tw_recycle setup is also moved to
events.d/60.nfs.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: in the test eventscript, run "ctdb" not "$CTDB".
Martin Schwenke [Thu, 26 Aug 2010 04:04:03 +0000 (14:04 +1000)]
Test suite: in the test eventscript, run "ctdb" not "$CTDB".

It is too hard to do anything else...

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoMerge branch 'master' of git://git.samba.org/sahlberg/ctdb
Martin Schwenke [Thu, 26 Aug 2010 01:06:57 +0000 (11:06 +1000)]
Merge branch 'master' of git://git.samba.org/sahlberg/ctdb

9 years agoAdd a configuration database, implemented as a persistent database.
Ronnie Sahlberg [Wed, 25 Aug 2010 01:37:32 +0000 (11:37 +1000)]
Add a configuration database, implemented as a persistent database.

This database can be used, as an option, to store
the public address assignment instead of editing the /etc/ctdb/public-addresses file manually.

This configuration is stored in one record per key, with a key-name of
public-addresses:node#<pnn>
where <pnn> is the node number.

The content of this record is the same syntax as the /etc/ctdb/public-addresses file.

When ctdbd starts, if this key exist and contains data. It is extracted from the database and compared with the normal file /etc/ctdb/public-addresses.

If the content differs, the config database "wins" and is used to overwrite/update the /etc/ctdb/public-addresses file, after which ctdbd is restarted.

The main benefit with this option is that it can be used to update the public address configuration for nodes that are offline/unreachable by updating their configuration in the persistent database.
Once the offline node is available again, it will resync its databases with the rest of the cluster, find out that the config has changed, apply the changes and restart ctdbd automatically.

The command to store the public address configuration for a node into the persistent database is :

ctdb pstore config.tdb public-addresses:node#<pnn> <filename>

where <pnn> is the node# we wish to update the config for, and <filename> is a file containing the new content for  that nodes public address configuration.

9 years agothe tfetch command can be used without the daemon running, so flag it as such.
Ronnie Sahlberg [Wed, 25 Aug 2010 01:10:08 +0000 (11:10 +1000)]
the tfetch command can be used without the daemon running, so flag it as such.

fix a couple of incorrect settings for "auto-all" for a few of the commands as well.

9 years agoadd a new command "ctdb tfetch" that can read a record straight out of the
Ronnie Sahlberg [Wed, 25 Aug 2010 00:53:54 +0000 (10:53 +1000)]
add a new command "ctdb tfetch" that can read a record straight out of the
tdb file.

the command automatically strips off the initial ctdb header off the record so it can only be used on ctdb managed tdb files, not on normal tdb files.

9 years agoWhen "ctdb pfetch" creates a new file, make sure we set some initial sane mode bits
Ronnie Sahlberg [Tue, 24 Aug 2010 23:54:37 +0000 (09:54 +1000)]
When "ctdb pfetch" creates a new file, make sure we set some initial sane mode bits

9 years agorun the "init" event before we freeze the databases
Ronnie Sahlberg [Tue, 24 Aug 2010 22:34:35 +0000 (08:34 +1000)]
run the "init" event before we freeze the databases
so that we can read from databases during this event

9 years agochange "ctdb pfetch" to take an optional third argument
Ronnie Sahlberg [Tue, 24 Aug 2010 22:07:03 +0000 (08:07 +1000)]
change "ctdb pfetch" to take an optional third argument
as a file to store the record in.

9 years agoadd a command to write a record to a persistent database
Ronnie Sahlberg [Tue, 24 Aug 2010 03:55:38 +0000 (13:55 +1000)]
add a command to write a record to a persistent database
"ctdb pstore <db> <key> <file containing possibly binary data>"

9 years agoget rid of two compiler warnings
Ronnie Sahlberg [Tue, 24 Aug 2010 03:35:33 +0000 (13:35 +1000)]
get rid of two compiler warnings

9 years agoAdd a command "ctdb pfetch <db> <record>" to read a record from
Ronnie Sahlberg [Tue, 24 Aug 2010 03:34:09 +0000 (13:34 +1000)]
Add a command "ctdb pfetch <db> <record>" to read a record from
a persistent database.

9 years agoMerge branch 'master' of git://git.samba.org/sahlberg/ctdb
Martin Schwenke [Tue, 24 Aug 2010 01:53:29 +0000 (11:53 +1000)]
Merge branch 'master' of git://git.samba.org/sahlberg/ctdb

9 years agomove the directives to build the devel file to the end of the specfile
Ronnie Sahlberg [Mon, 23 Aug 2010 06:00:19 +0000 (16:00 +1000)]
move the directives to build the devel file to the end of the specfile
so that the dependencies are right
or else the dependencies all end up in the devel package and not the main
ctdb package

9 years agoDont set next_interval to 0.
Ronnie Sahlberg [Fri, 20 Aug 2010 04:54:03 +0000 (14:54 +1000)]
Dont set next_interval to 0.
This can cause ctdbd to spin at 100% in the eventsystem,
creating a timed event that will immediately trigger again
and again.

On uniprocessors this cause the eventscript we are actually waiting for to
basically become cpu starved and never complete.

9 years agoctdb ip is very busy.
Ronnie Sahlberg [Fri, 20 Aug 2010 01:38:34 +0000 (11:38 +1000)]
ctdb ip is very busy.

revert the defauls case back to only showing the ip and node
and only display the extra info if -v verbose output is requested

9 years agoadd a new commandline flag -v to enable verbose output
Ronnie Sahlberg [Fri, 20 Aug 2010 01:28:24 +0000 (11:28 +1000)]
add a new commandline flag -v to enable verbose output

9 years agomake it possible to "ctdb gettickle" to only list tickles for a certain
Ronnie Sahlberg [Fri, 20 Aug 2010 01:25:12 +0000 (11:25 +1000)]
make it possible to "ctdb gettickle" to only list tickles for a certain
port.

Default is to continue to show all tickles, but if a second argument
is given, only tickles for that port will be shown.

9 years agoDont use the deprecated talloc_append_string()
Ronnie Sahlberg [Fri, 20 Aug 2010 01:03:17 +0000 (11:03 +1000)]
Dont use the deprecated talloc_append_string()
Use talloc_strdup_append() instead

9 years agoWe need the deprecated talloc_append_string() for now
Ronnie Sahlberg [Thu, 19 Aug 2010 04:48:19 +0000 (14:48 +1000)]
We need the deprecated talloc_append_string() for now
so set the TALLOC_DEPRECATED sympol to allow use of this call
from ctdb_client.c

9 years agoMerge commit 'rusty/ports-from-1.0.112' into foo
Ronnie Sahlberg [Thu, 19 Aug 2010 03:17:56 +0000 (13:17 +1000)]
Merge commit 'rusty/ports-from-1.0.112' into foo

9 years agoMerge commit 'rusty/vacuum-fix-master'
Ronnie Sahlberg [Thu, 19 Aug 2010 03:16:35 +0000 (13:16 +1000)]
Merge commit 'rusty/vacuum-fix-master'

9 years ago On RHEL, "service nfs stop;service nfs start" and "service nfs restart"
Ronnie Sahlberg [Wed, 18 Aug 2010 21:18:22 +0000 (07:18 +1000)]
On RHEL,    "service nfs stop;service nfs start"  and "service nfs restart"
    sometimes (very rarely) fails to restart the service.

    Add a function to restart NFSd on SLES and RHEL-like systems.

    If we detect the system is unhealthy due to kNFSd not running,
    try to restart the service again "service nfs restart" and
    hope for the best.

CQ1019372

9 years agoAdd machinereadable output for the "ctgdb gettickles <ip>" command
Ronnie Sahlberg [Wed, 18 Aug 2010 04:37:16 +0000 (14:37 +1000)]
Add machinereadable output for the "ctgdb gettickles <ip>" command

9 years agoRemove the structure ctdb_control_tcp_vnn since this is identical to the structure...
Ronnie Sahlberg [Wed, 18 Aug 2010 02:36:03 +0000 (12:36 +1000)]
Remove the structure ctdb_control_tcp_vnn since this is identical to the structure ctdb_tcp_connection.

Add a new "ctdb deltickle" command to delete tickles from the database.
This can ONLY be used for tickles created by "ctdb addtickle".

Push any "addtickle/deltickle" updates to other nodes every TickleUpdateInterval seconds'

9 years agologging: give a unique logging name to each forked child.
Rusty Russell [Mon, 19 Jul 2010 09:59:09 +0000 (19:29 +0930)]
logging: give a unique logging name to each forked child.

This means we can distinguish which child is logging, esp. via syslog where we have no pid.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agotakeover: prevent crash by avoiding free in traverse on RST timeout
Rusty Russell [Mon, 26 Jul 2010 04:28:48 +0000 (13:58 +0930)]
takeover: prevent crash by avoiding free in traverse on RST timeout

After 5 attempts to send a RST to a client without any response, we free
"con"; this is done during a traverse.  This frees the node we are walking
through (the node is made a child of "con" down in rb_tree.c's
trbt_create_node() (Valgrind would catch this, as Martin confirmed).

So, we create a temporary parent and reparent onto that; then we free
that parent after the traverse, thus deleting the unwanted nodes.

CQ:S1019041
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agoMove NAT gateway firewall rules to recovered|updatenatgw events.
Martin Schwenke [Tue, 6 Jul 2010 07:54:43 +0000 (17:54 +1000)]
Move NAT gateway firewall rules to recovered|updatenatgw events.

The existing code wasn't working as designed in the start event.  It
should work here.

BZ: 62613
Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agovacuum: disabling vacuuming during a freeze
Rusty Russell [Wed, 21 Jul 2010 02:58:04 +0000 (12:28 +0930)]
vacuum: disabling vacuuming during a freeze

We shouldn't even think about vacuuming when we've frozen the database
(which is earlier than when we set CTDB_RECOVERY_ACTIVE)

CQ:S1018154 & S1018349
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agovacuum: fix crash on vacuum abort
Rusty Russell [Mon, 26 Jul 2010 06:38:07 +0000 (16:08 +0930)]
vacuum: fix crash on vacuum abort

Martin Schwenke discovered that 517f05e42f17766b1e8db8f1f4789cbad968e304
("freeze: abort vacuuming when we're going to freeze.") used ctdb_db for
a logging message which is in fact uninitialized, causing a crash (even
if it wasn't actually logged).

Initialize it properly.  Also fix incorrect format in another logging
message introduced in that same change.

CQ:S1019093
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agoTest suite: loosen the getmonmode test.
Martin Schwenke [Wed, 18 Aug 2010 01:25:44 +0000 (11:25 +1000)]
Test suite: loosen the getmonmode test.

Monitoring could be off at the beginning of the test.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agofreeze: abort vacuuming when we're going to freeze.
Rusty Russell [Wed, 21 Jul 2010 02:59:55 +0000 (12:29 +0930)]
freeze: abort vacuuming when we're going to freeze.

There are some reports of freeze timeouts, and it looks like vacuuming might
be the culprit.  So we add code to tell them to abort when a freeze is
going on.

(This is based on the 1.0.112 branch version 517f05e42f, but far
 simpler since tdb is now robust against processes being killed during
 transaction commit)

CQ:S1018154 & S1018349
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agoAdd a new "ctdb addtickle" command to manually add tickles to ctdbd
Ronnie Sahlberg [Wed, 18 Aug 2010 01:09:32 +0000 (11:09 +1000)]
Add a new "ctdb addtickle" command to manually add tickles to ctdbd

This can be used to set ctdbd up to generate a tickle for non-samba
services.
(samba contains code to set tickles up automatically)

9 years agoupdate the example for the new signature of
Ronnie Sahlberg [Wed, 18 Aug 2010 00:18:35 +0000 (10:18 +1000)]
update the example for the new signature of
ctdb_set_message_handler_send()

9 years agoWe use eventloop nesting in a couple of places, notably the sync
Ronnie Sahlberg [Wed, 18 Aug 2010 00:11:59 +0000 (10:11 +1000)]
We use eventloop nesting in a couple of places, notably the sync
parts of the recovery daemon.

Initialize all event contexts to allow nesting

9 years agoMerge commit 'rusty/libctdb-new' into foo
Ronnie Sahlberg [Tue, 17 Aug 2010 23:53:52 +0000 (09:53 +1000)]
Merge commit 'rusty/libctdb-new' into foo

9 years agoevent: Update events to latest Samba version 0.9.8
Rusty Russell [Tue, 17 Aug 2010 23:46:31 +0000 (09:16 +0930)]
event: Update events to latest Samba version 0.9.8

In Samba this is now called "tevent", and while we use the backwards
compatibility wrappers they don't offer EVENT_FD_AUTOCLOSE: that is now
a separate tevent_fd_set_auto_close() function.

This is based on Samba version 7f29f817fa939ef1bbb740584f09e76e2ecd5b06.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agotalloc: update to 2.0.3 version from SAMBA
Rusty Russell [Tue, 17 Aug 2010 23:41:58 +0000 (09:11 +0930)]
talloc: update to 2.0.3 version from SAMBA

This is based on SAMBA as at revision 2de63aa2801a907905b3e05557074af5b896d486.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agoTest suite: Add more timestamping of debugging information.
Martin Schwenke [Mon, 16 Aug 2010 23:55:48 +0000 (09:55 +1000)]
Test suite: Add more timestamping of debugging information.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: print date/time at test completion.
Martin Schwenke [Mon, 16 Aug 2010 23:52:15 +0000 (09:52 +1000)]
Test suite: print date/time at test completion.

This should help with log cross-checking.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoCorrectly set docdir
Volker Lendecke [Fri, 6 Aug 2010 08:12:13 +0000 (10:12 +0200)]
Correctly set docdir

9 years agotdb: workaround starvation problem in locking entire database.
Rusty Russell [Mon, 16 Aug 2010 00:52:21 +0000 (10:22 +0930)]
tdb: workaround starvation problem in locking entire database.

(Imported from SAMBA 11ab43084b10cf53b530cdc3a6036c898b79ca38)

We saw tdb_lockall() take 71 seconds under heavy load; this is because Linux
(at least) doesn't prevent new small locks being obtained while we're waiting
for a big log.

The workaround is to do divide and conquer using non-blocking chainlocks: if
we get down to a single chain we block.  Using a simple test program where
children did "hold lock for 100ms, sleep for 1 second" the time to do
tdb_lockall() dropped signifiantly.  There are ln(hashsize) locks taken in
the contended case, but that's slow anyway.

More analysis is given in my blog at http://rusty.ozlabs.org/?p=120

This may also help transactions, though in that case it's the initial
read lock which uses this gradual locking routine; the update-to-write-lock
code is separate and still tries to update in one go.

Even though ABI doesn't change, minor version bumped so behavior change
can be easily detected.

CQ:S1018154
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agotdb: Fix tdb_check() to work with read-only tdb databases.
Rusty Russell [Mon, 16 Aug 2010 00:43:32 +0000 (10:13 +0930)]
tdb: Fix tdb_check() to work with read-only tdb databases.

(Import from SAMBA bc1c82ea137e1bf6cb55139a666c56ebb2226b23)
The function tdb_lockall() uses F_WRLCK internally, which doesn't work on
a fd opened with O_RDONLY. Use tdb_lockall_read() instead.

9 years agotdb: remove unused variable in tdb_new_database().
Rusty Russell [Mon, 16 Aug 2010 00:42:02 +0000 (10:12 +0930)]
tdb: remove unused variable in tdb_new_database().

(Imported from SAMBA 2eab1d7fdcb54f9ec27431ca4858eb64cb1bd835)

9 years agotdb: fix short write logic in tdb_new_database
Rusty Russell [Mon, 16 Aug 2010 00:50:19 +0000 (10:20 +0930)]
tdb: fix short write logic in tdb_new_database

Commit 207a213c/24fed55d purported to fix the problem of signals during
tdb_new_database (which could cause a spurious short write, hence a failure).
However, the code is wrong: newdb+written is not correct.

Fix this by introducing a general tdb_write_all() and using it here and in
the tracing code.

Cc: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agoTest suite: strengthen function _cluster_is_healthy().
Martin Schwenke [Fri, 13 Aug 2010 07:01:54 +0000 (17:01 +1000)]
Test suite: strengthen function _cluster_is_healthy().

If there's a chance that "ctdb status -Y" can return 0 but print
garbage then this function might return a false positive.

So, we do 2 things:

* Redirect stderr to >/dev/null rather than looking at it.  This
  minimises the chance that we will see garbage.

* Since we need at least 1 good line to decide the cluster is healthy,
  we sanity check each line to esnure it starts with :[0-9].

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: use $CTDB rather than ctdb everywhere in ctdb_test_functions.sh.
Martin Schwenke [Thu, 12 Aug 2010 04:13:07 +0000 (14:13 +1000)]
Test suite: use $CTDB rather than ctdb everywhere in ctdb_test_functions.sh.

Also ensure that $CTDB is set by default it to "ctdb".

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: improve wait_until_node_has_status()
Martin Schwenke [Thu, 12 Aug 2010 03:48:33 +0000 (13:48 +1000)]
Test suite: improve wait_until_node_has_status()

This currently does "onnode any ... wait_until ...".  If ctdbd is
being shutdown on a node then that node might be chosen anyway, if it
is asked early enough.  Then we'll loop on that node but our ctdb
client command may always fail, causing a timeout rather than the
expected behaviour.

This puts the loop on the outside of the "onnode any" so that if the
"wrong" node is chosen initially then on the next iteration the choice
can be remade.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: make addip test use $CTDB rather than ctdb in debug code.
Martin Schwenke [Wed, 11 Aug 2010 06:55:33 +0000 (16:55 +1000)]
Test suite: make addip test use $CTDB rather than ctdb in debug code.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoCreate a new command "ctdb sync" that isd just an alias for "ctdb ipreallocate"
Ronnie Sahlberg [Mon, 9 Aug 2010 23:43:17 +0000 (09:43 +1000)]
Create a new command "ctdb sync"   that isd just an alias for "ctdb ipreallocate"

9 years agoUpdate a log message to reflect that this does no longer only happen
Ronnie Sahlberg [Mon, 9 Aug 2010 23:41:41 +0000 (09:41 +1000)]
Update a log message to reflect that this does no longer only happen
when trying/failing to ban a node.

9 years agolibctdb: add synchronous message handling and unregister, with tests.
Rusty Russell [Mon, 9 Aug 2010 06:11:32 +0000 (15:41 +0930)]
libctdb: add synchronous message handling and unregister, with tests.

It turns out that we *do* want a separate private arg for the message
handler and the completion callback, so we change that.

We also fix the prototypes of the remove_message functions as we
implement them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
9 years agoMerge remote branch 'martins/master'
Ronnie Sahlberg [Mon, 9 Aug 2010 01:35:38 +0000 (11:35 +1000)]
Merge remote branch 'martins/master'

9 years agoAdd some command-line options to ctdb_diagnostics.
Martin Schwenke [Fri, 6 Aug 2010 01:10:56 +0000 (11:10 +1000)]
Add some command-line options to ctdb_diagnostics.

In some contexts ctdb_diagnostics generates too many errors when it is
run on heterogeneous and machine-configured clusters.  In some
clusters some nodes are expected to be differently configured and also
machine-generated configured files can have comments containing
timestamps.

This adds some command-line options that can be used to reduce the
number of errors reported:

    -n <nodes>  Comma separated list of nodes to operate on
    -c          Ignore comment lines (starting with '#') in file comparisons
    -w          Ignore whitespace in file comparisons
    --no-ads    Do not use commands that assume an Active Directory Server

The -n option simply allows ctdb_diagnostics to operate on a subset of
nodes, avoiding file comparisons with and data collection on nodes
that are differently configured.  For file comparisons, instead of
showing each file on the current node and then comparing other nodes
to that file, the file from the first (available or requested) nodes
is shown and then other nodes are compared to that.  That has resulted
in changes in output - that is, ctdb diagnostics no longer prints
messages referencing the current node.

-c and -w are used to weaken comparisons between configuration files.

--no-ads can be used to avoid running ADS-specific commands if a
cluster uses LDAP (or other non-ADS) configuration.

This also fixes a number of bugs in related code:

* A call to onnode was losing the >> NODE ...  << lines because they
  now go to stderr.  This was changed in onnode long ago but
  ctdb_diagnostics was never updated to match.

* ctdb_diagnostics was counting lines in /etc/ctdb/nodes to determine
  what nodes to operate on.  For some time the nodes file has
  supported syntax that makes this invalid.  "ctdb listnodes -Y" is
  now used to list available nodes.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoiupdate the docs that ctdb freeze is no more
Ronnie Sahlberg [Thu, 5 Aug 2010 06:35:37 +0000 (16:35 +1000)]
iupdate the docs that ctdb freeze is no more

9 years ago remove the "ctdb freeze" debugging command
Ronnie Sahlberg [Thu, 5 Aug 2010 06:30:47 +0000 (16:30 +1000)]
 remove the "ctdb freeze" debugging command

9 years agoTest suite: remove unnecessary verbosity from enable/continue tests.
Martin Schwenke [Thu, 5 Aug 2010 06:03:21 +0000 (16:03 +1000)]
Test suite: remove unnecessary verbosity from enable/continue tests.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: Fix typo in continue test.
Martin Schwenke [Thu, 5 Aug 2010 06:01:23 +0000 (16:01 +1000)]
Test suite: Fix typo in continue test.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: weaken ctdb continue/enable tests for non-deterministic IPs.
Martin Schwenke [Thu, 5 Aug 2010 05:58:56 +0000 (15:58 +1000)]
Test suite: weaken ctdb continue/enable tests for non-deterministic IPs.

These tests currently wait for the old IPs to fail back to the test
node.  This isn't guaranteed with DeterministicIPs disabled.

This changes those tests to wait until the test node gets at least 1
IP assigned.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoinitscript: wait until we can ping ctdbd before setting tunables.
Martin Schwenke [Thu, 5 Aug 2010 05:29:40 +0000 (15:29 +1000)]
initscript: wait until we can ping ctdbd before setting tunables.

Currently we do a "sleep 1" after starting and before running
set_ctdb_variables to set the tunables.  This is too arbitrary and
might fail if the system is heavily loaded.  This, for example, could
result in some nodes running with DeterministicIPs and some without,
in which case a different IP allocation algorithm would run depending
on who is the recmaster!

This makes the start function wait until "ctdb ping" succeeds (with 10
second timeout) before trying to run set_ctdb_variables.  If a timeout
occurs then the start function attempts to kill ctdbd before exiting
with a failure.

It also cleans up the status reporting code for Red Hat and SUSE so
that the final status code is reported.  Currently there are cases
where a correct status is prematurely reported before a failure
occurs.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite - make the ctdb_fetch test cope with "Reqid wrap!" messages.
Martin Schwenke [Thu, 5 Aug 2010 03:43:50 +0000 (13:43 +1000)]
Test suite - make the ctdb_fetch test cope with "Reqid wrap!" messages.

Recent CTDB notice the wrap and print this message.  The test needs to
cope.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: remove thaw/freeze tests.
Martin Schwenke [Thu, 5 Aug 2010 01:40:05 +0000 (11:40 +1000)]
Test suite: remove thaw/freeze tests.

They test debugging commands that no longer operate as expected.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite - fix addip test.
Martin Schwenke [Wed, 4 Aug 2010 06:08:12 +0000 (16:08 +1000)]
Test suite - fix addip test.

The test currently checks that all existing IPs plus the newly added
IP are on the test node after "ctdb addip" is run.  With
DeterministicIPs enabled, if the new IP is "before" other IPs then the
other IPs may be shuffled by the deterministic IPs modulo algorithm.
This will happen on the 1st recovery after the move.  Sometimes this
recovery happens before we get the list of IPs to check and sometimes
after, so the test is racy.

The fix is to simply check for the presence of the new IP and not
worry about the others.  This reduces whatever value this test
had... but you can't have everything.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoMerge remote branch 'martins/master'
Martin Schwenke [Wed, 4 Aug 2010 06:05:39 +0000 (16:05 +1000)]
Merge remote branch 'martins/master'

9 years agoTest suite - try to make addip test more reliable and add some debugging.
Martin Schwenke [Wed, 4 Aug 2010 03:16:06 +0000 (13:16 +1000)]
Test suite - try to make addip test more reliable and add some debugging.

This test is failing in some situations.  The "ctdb addip" command
works but the IP never appears in the "ctdb ip" output.

Try restricting the last octet to be between 101-199.  At the moment
addresses like 10.0.2.1 are being chosen and these are often the
address of the host machine in autocluster configurations... so might
cause weirdness.

Also add some debugging if checking for the IP address times out.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - add option to change odds of a failure.
Martin Schwenke [Tue, 3 Aug 2010 01:51:14 +0000 (11:51 +1000)]
Testing: IP allocation simulation - add option to change odds of a failure.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - clean up usage message.
Martin Schwenke [Tue, 3 Aug 2010 01:41:50 +0000 (11:41 +1000)]
Testing: IP allocation simulation - clean up usage message.

Group options better and make the language consistent between options.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - print maximum number of unhealthy nodes.
Martin Schwenke [Tue, 3 Aug 2010 01:37:34 +0000 (11:37 +1000)]
Testing: IP allocation simulation - print maximum number of unhealthy nodes.

This can imply something about imbalance.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - improve help for options.
Martin Schwenke [Tue, 3 Aug 2010 01:36:33 +0000 (11:36 +1000)]
Testing: IP allocation simulation - improve help for options.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - make usage/failure more obvious.
Martin Schwenke [Mon, 2 Aug 2010 05:46:23 +0000 (15:46 +1000)]
Testing: IP allocation simulation - make usage/failure more obvious.

Tweak the usage message for -g option.

Print an error if no node groups defined, instead of curious Python
error.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - rename an example to node_group_extra.py.
Martin Schwenke [Mon, 2 Aug 2010 05:09:13 +0000 (15:09 +1000)]
Testing: IP allocation simulation - rename an example to node_group_extra.py.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - rename an example to node_group_simple.py.
Martin Schwenke [Mon, 2 Aug 2010 05:07:56 +0000 (15:07 +1000)]
Testing: IP allocation simulation - rename an example to node_group_simple.py.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - add general node group example.
Martin Schwenke [Mon, 2 Aug 2010 05:06:39 +0000 (15:06 +1000)]
Testing: IP allocation simulation - add general node group example.

This allows node pool configuration to be specifed on the
command-line.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - update options processing in examples.
Martin Schwenke [Mon, 2 Aug 2010 05:01:47 +0000 (15:01 +1000)]
Testing: IP allocation simulation - update options processing in examples.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - Update README.
Martin Schwenke [Mon, 2 Aug 2010 04:58:15 +0000 (14:58 +1000)]
Testing: IP allocation simulation - Update README.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - fix nondeterminism in do_something_random().
Martin Schwenke [Mon, 2 Aug 2010 04:24:00 +0000 (14:24 +1000)]
Testing: IP allocation simulation - fix nondeterminism in do_something_random().

The current code makes random choices from unsorted lists.  This
ensures the lists are sorted.

Also, make the code easier to read by doing the random selction from
lists of PNNs rather than lists of Node objects.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - Tweak options handling and Cluster.diff().
Martin Schwenke [Mon, 2 Aug 2010 04:20:12 +0000 (14:20 +1000)]
Testing: IP allocation simulation - Tweak options handling and Cluster.diff().

process_args() must now be called by programs inporting this module.
Options are put into global variable "options", which can be
references using "ctdb_takeover.options".

Can now pass extra option specifications to process_args().

Remove global variable prev and make it a Cluster object variable.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - update copyright message.
Martin Schwenke [Mon, 2 Aug 2010 04:16:02 +0000 (14:16 +1000)]
Testing: IP allocation simulation - update copyright message.

There's a lot of new code here, so let's make the copyright message
make sense.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - add command line option for random seed.
Martin Schwenke [Sun, 1 Aug 2010 01:53:28 +0000 (11:53 +1000)]
Testing: IP allocation simulation - add command line option for random seed.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation - save some warnings for verbose mode.
Martin Schwenke [Sun, 1 Aug 2010 01:41:52 +0000 (11:41 +1000)]
Testing: IP allocation simulation - save some warnings for verbose mode.

We don't need to see warnings about unallocatable IPs unless we're in
verbose mode.  Can node be run with -n (and without -v or -d) to see
just the statistics.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: IP allocation simulation prints final imbalance in statistics.
Martin Schwenke [Sun, 1 Aug 2010 01:41:02 +0000 (11:41 +1000)]
Testing: IP allocation simulation prints final imbalance in statistics.

This is useful to know.  When things get unbalance they tend to stay
that way.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: In IP allocation simulation count total number of events.
Martin Schwenke [Sun, 1 Aug 2010 01:39:30 +0000 (11:39 +1000)]
Testing: In IP allocation simulation count total number of events.

This starts at -1 because we always have to do the initial allocation.

No longer print event number for each event by default, only when
verbose is enabled.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: Add imbalance information to IP allocation simulation.
Martin Schwenke [Sun, 1 Aug 2010 01:37:35 +0000 (11:37 +1000)]
Testing: Add imbalance information to IP allocation simulation.

Implement the imbalance calculations.

Also add command-line option to display imbalance for each step.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoMerge branch 'master' of git://git.samba.org/sahlberg/ctdb
Martin Schwenke [Sat, 31 Jul 2010 10:34:45 +0000 (20:34 +1000)]
Merge branch 'master' of git://git.samba.org/sahlberg/ctdb

9 years agoTesting: Add Python IP allocation simulation.
Martin Schwenke [Fri, 30 Jul 2010 06:45:36 +0000 (16:45 +1000)]
Testing: Add Python IP allocation simulation.

Includes simulation module and example scenarios.  This allows you to
test and perhaps tweak an algorithm that should be the same as the
current CTDB IP reallocation one.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoOptimise 61.nfstickle to write the tickles more efficiently.
Martin Schwenke [Mon, 26 Jul 2010 06:22:59 +0000 (16:22 +1000)]
Optimise 61.nfstickle to write the tickles more efficiently.

Currently the file for each IP address is reopened to append the
details of each source socket.

This optimisation puts all the logic into awk, including the matching
of output lines from netstat.  The source sockets for each for each
destination IP are written into an array entry and then each array
entry is written to the corresponding file in a single operation.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: handle extra lines in statistics output.
Martin Schwenke [Mon, 7 Jun 2010 02:03:25 +0000 (12:03 +1000)]
Test suite: handle extra lines in statistics output.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: handle change to disconnected node error message.
Martin Schwenke [Mon, 7 Jun 2010 02:29:31 +0000 (12:29 +1000)]
Test suite: handle change to disconnected node error message.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTesting: Add Python IP allocation simulation.
Martin Schwenke [Fri, 30 Jul 2010 06:45:36 +0000 (16:45 +1000)]
Testing: Add Python IP allocation simulation.

Includes simulation module and example scenarios.  This allows you to
test and perhaps tweak an algorithm that should be the same as the
current CTDB IP reallocation one.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoAdd a code-style document.
Ronnie Sahlberg [Fri, 30 Jul 2010 06:37:22 +0000 (16:37 +1000)]
Add a code-style document.

Shamelessly sto^H^H^Hborrowed from samba3.

9 years agoevents/10.interface: we need to mark interfaces as "up" if we don't know how to monit...
Stefan Metzmacher [Fri, 30 Jul 2010 06:09:40 +0000 (08:09 +0200)]
events/10.interface: we need to mark interfaces as "up" if we don't know how to monitor them

metze

9 years agoMerge commit 'rusty/master'
Ronnie Sahlberg [Fri, 30 Jul 2010 06:25:40 +0000 (16:25 +1000)]
Merge commit 'rusty/master'

9 years agoctdb: Fixed use of reserved word "private" in typedefs
Evan Kinney [Thu, 29 Jul 2010 02:48:46 +0000 (22:48 -0400)]
ctdb: Fixed use of reserved word "private" in typedefs

In include/ctdb.h, ctdb_callback_t and ctdb_rrl_callback_t were
defined with a void *private variable. The variable name was
changed to void *private_data to avoid issues encountered in
the Samba autoconf script.

Evan Kinney <evan.kinney@sas.com>

9 years agoOptimise 61.nfstickle to write the tickles more efficiently.
Martin Schwenke [Mon, 26 Jul 2010 06:22:59 +0000 (16:22 +1000)]
Optimise 61.nfstickle to write the tickles more efficiently.

Currently the file for each IP address is reopened to append the
details of each source socket.

This optimisation puts all the logic into awk, including the matching
of output lines from netstat.  The source sockets for each for each
destination IP are written into an array entry and then each array
entry is written to the corresponding file in a single operation.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: handle extra lines in statistics output.
Martin Schwenke [Mon, 7 Jun 2010 02:03:25 +0000 (12:03 +1000)]
Test suite: handle extra lines in statistics output.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoTest suite: handle change to disconnected node error message.
Martin Schwenke [Mon, 7 Jun 2010 02:29:31 +0000 (12:29 +1000)]
Test suite: handle change to disconnected node error message.

Signed-off-by: Martin Schwenke <martin@meltin.net>