git.samba.org - obnox/samba/samba-obnox.git/log

git.samba.org / obnox / samba / samba-obnox.git / log

Ronnie Sahlberg [Thu, 19 Aug 2010 04:48:19 +0000 (14:48 +1000)]

We need the deprecated talloc_append_string() for now
so set the TALLOC_DEPRECATED sympol to allow use of this call
from ctdb_client.c

(This used to be ctdb commit 3afa5d945a56952a7f211af068d671945de960e5)

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 19 Aug 2010 03:17:56 +0000 (13:17 +1000)]

Merge commit 'rusty/ports-from-1.0.112' into foo

(This used to be ctdb commit 13e58d92f5f1723e850a82ae030d0ca57e89b1ee)

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 19 Aug 2010 03:16:35 +0000 (13:16 +1000)]

Merge commit 'rusty/vacuum-fix-master'

(This used to be ctdb commit dc301b324d2c14a2425a965c076113c4fe97903e)

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 18 Aug 2010 21:18:22 +0000 (07:18 +1000)]

On RHEL,    "service nfs stop;service nfs start"  and "service nfs restart"
    sometimes (very rarely) fails to restart the service.

    Add a function to restart NFSd on SLES and RHEL-like systems.

    If we detect the system is unhealthy due to kNFSd not running,
    try to restart the service again "service nfs restart" and
    hope for the best.

CQ1019372

(This used to be ctdb commit 25c4ce7e919f13226219f036bcffd2be76b2f06c)

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 18 Aug 2010 04:37:16 +0000 (14:37 +1000)]

Add machinereadable output for the "ctgdb gettickles <ip>" command

(This used to be ctdb commit c3eb53509331045074579468d94ed7e31101bba4)

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 18 Aug 2010 02:36:03 +0000 (12:36 +1000)]

Remove the structure ctdb_control_tcp_vnn since this is identical to the structure ctdb_tcp_connection.

Add a new "ctdb deltickle" command to delete tickles from the database.
This can ONLY be used for tickles created by "ctdb addtickle".

Push any "addtickle/deltickle" updates to other nodes every TickleUpdateInterval seconds'

(This used to be ctdb commit acded034e2f0dcae4c2c9e54e16a001caf23caec)

commit | commitdiff | tree

Rusty Russell [Mon, 19 Jul 2010 09:59:09 +0000 (19:29 +0930)]

logging: give a unique logging name to each forked child.

This means we can distinguish which child is logging, esp. via syslog where we have no pid.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 68b3761a0874429b90731741f0531f76dcfbb081)

commit | commitdiff | tree

Rusty Russell [Mon, 26 Jul 2010 04:28:48 +0000 (13:58 +0930)]

takeover: prevent crash by avoiding free in traverse on RST timeout

After 5 attempts to send a RST to a client without any response, we free
"con"; this is done during a traverse. This frees the node we are walking
through (the node is made a child of "con" down in rb_tree.c's
trbt_create_node() (Valgrind would catch this, as Martin confirmed).

So, we create a temporary parent and reparent onto that; then we free
that parent after the traverse, thus deleting the unwanted nodes.

CQ:S1019041
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 08f7f85477610a4916c1ec866aa467b28f1bbec3)

commit | commitdiff | tree

Martin Schwenke [Tue, 6 Jul 2010 07:54:43 +0000 (17:54 +1000)]

Move NAT gateway firewall rules to recovered|updatenatgw events.

The existing code wasn't working as designed in the start event. It
should work here.

BZ: 62613
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit aeb70c7e7822854eb87873a5c7783e27e6e72318)

commit | commitdiff | tree

Rusty Russell [Wed, 21 Jul 2010 02:58:04 +0000 (12:28 +0930)]

vacuum: disabling vacuuming during a freeze

We shouldn't even think about vacuuming when we've frozen the database
(which is earlier than when we set CTDB_RECOVERY_ACTIVE)

CQ:S1018154 & S1018349
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit d8df6835a931082af232c4b94f1dede6f16169f9)

commit | commitdiff | tree

Rusty Russell [Mon, 26 Jul 2010 06:38:07 +0000 (16:08 +0930)]

vacuum: fix crash on vacuum abort

Martin Schwenke discovered that 517f05e42f17766b1e8db8f1f4789cbad968e304
("freeze: abort vacuuming when we're going to freeze.") used ctdb_db for
a logging message which is in fact uninitialized, causing a crash (even
if it wasn't actually logged).

Initialize it properly. Also fix incorrect format in another logging
message introduced in that same change.

CQ:S1019093
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 8e518950ba281502318d6300f7a5ec6cdf6b5674)

commit | commitdiff | tree

Rusty Russell [Wed, 21 Jul 2010 02:59:55 +0000 (12:29 +0930)]

freeze: abort vacuuming when we're going to freeze.

There are some reports of freeze timeouts, and it looks like vacuuming might
be the culprit. So we add code to tell them to abort when a freeze is
going on.

(This is based on the 1.0.112 branch version 517f05e42f, but far
simpler since tdb is now robust against processes being killed during
transaction commit)

CQ:S1018154 & S1018349
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit f5d7dc679501e607c2c83a248a89d3cada9df146)

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 18 Aug 2010 01:09:32 +0000 (11:09 +1000)]

Add a new "ctdb addtickle" command to manually add tickles to ctdbd

This can be used to set ctdbd up to generate a tickle for non-samba
services.
(samba contains code to set tickles up automatically)

(This used to be ctdb commit 7ef2cddad5326fdcc26138906948342039829495)

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 18 Aug 2010 00:18:35 +0000 (10:18 +1000)]

update the example for the new signature of
ctdb_set_message_handler_send()

(This used to be ctdb commit 6aabe52d5ba629291aa630bc96a2b74dcecc5209)

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 18 Aug 2010 00:11:59 +0000 (10:11 +1000)]

We use eventloop nesting in a couple of places, notably the sync
parts of the recovery daemon.

Initialize all event contexts to allow nesting

(This used to be ctdb commit 5bf6bd5e7f33aabbeb7b9707716ef99cf471e590)

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 17 Aug 2010 23:53:52 +0000 (09:53 +1000)]

Merge commit 'rusty/libctdb-new' into foo

(This used to be ctdb commit 1566d2d23ab698896b3b6a76974a5c7452db4a62)

commit | commitdiff | tree

Rusty Russell [Tue, 17 Aug 2010 23:46:31 +0000 (09:16 +0930)]

event: Update events to latest Samba version 0.9.8

In Samba this is now called "tevent", and while we use the backwards
compatibility wrappers they don't offer EVENT_FD_AUTOCLOSE: that is now
a separate tevent_fd_set_auto_close() function.

This is based on Samba version 7f29f817fa939ef1bbb740584f09e76e2ecd5b06.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 85e5e760cc91eb3157d3a88996ce474491646726)

commit | commitdiff | tree

Rusty Russell [Tue, 17 Aug 2010 23:41:58 +0000 (09:11 +0930)]

talloc: update to 2.0.3 version from SAMBA

This is based on SAMBA as at revision 2de63aa2801a907905b3e05557074af5b896d486.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit cecd93be0a0aab868430dd43f8276bfb4e35f02e)

commit | commitdiff | tree

Volker Lendecke [Fri, 6 Aug 2010 08:12:13 +0000 (10:12 +0200)]

Correctly set docdir

(This used to be ctdb commit a69916d0687309766b0014dc9cee6a966aaa89da)

commit | commitdiff | tree

Rusty Russell [Mon, 16 Aug 2010 00:52:21 +0000 (10:22 +0930)]

tdb: workaround starvation problem in locking entire database.

(Imported from SAMBA 11ab43084b10cf53b530cdc3a6036c898b79ca38)

We saw tdb_lockall() take 71 seconds under heavy load; this is because Linux
(at least) doesn't prevent new small locks being obtained while we're waiting
for a big log.

The workaround is to do divide and conquer using non-blocking chainlocks: if
we get down to a single chain we block. Using a simple test program where
children did "hold lock for 100ms, sleep for 1 second" the time to do
tdb_lockall() dropped signifiantly. There are ln(hashsize) locks taken in
the contended case, but that's slow anyway.

More analysis is given in my blog at http://rusty.ozlabs.org/?p=120

This may also help transactions, though in that case it's the initial
read lock which uses this gradual locking routine; the update-to-write-lock
code is separate and still tries to update in one go.

Even though ABI doesn't change, minor version bumped so behavior change
can be easily detected.

CQ:S1018154
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 9ec0009443a0ac4187ce5212a5143689daa58a02)

commit | commitdiff | tree

Rusty Russell [Mon, 16 Aug 2010 00:43:32 +0000 (10:13 +0930)]

tdb: Fix tdb_check() to work with read-only tdb databases.

(Import from SAMBA bc1c82ea137e1bf6cb55139a666c56ebb2226b23)
The function tdb_lockall() uses F_WRLCK internally, which doesn't work on
a fd opened with O_RDONLY. Use tdb_lockall_read() instead.

(This used to be ctdb commit a5db1122ec48d7e7384066848457c850c1a6cf3c)

commit | commitdiff | tree

Rusty Russell [Mon, 16 Aug 2010 00:42:02 +0000 (10:12 +0930)]

tdb: remove unused variable in tdb_new_database().

(Imported from SAMBA 2eab1d7fdcb54f9ec27431ca4858eb64cb1bd835)

(This used to be ctdb commit 52a87e608d0406aee9df99f7ac3ce16e834b520b)

commit | commitdiff | tree

Rusty Russell [Mon, 16 Aug 2010 00:50:19 +0000 (10:20 +0930)]

tdb: fix short write logic in tdb_new_database

Commit 207a213c/24fed55d purported to fix the problem of signals during
tdb_new_database (which could cause a spurious short write, hence a failure).
However, the code is wrong: newdb+written is not correct.

Fix this by introducing a general tdb_write_all() and using it here and in
the tracing code.

Cc: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 27ba0e5a6681063225df7244a85aa304c51c6948)

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 9 Aug 2010 23:43:17 +0000 (09:43 +1000)]

Create a new command "ctdb sync" that isd just an alias for "ctdb ipreallocate"

(This used to be ctdb commit eededd592c92c59b435f0046989b2327fcc280b1)

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 9 Aug 2010 23:41:41 +0000 (09:41 +1000)]

Update a log message to reflect that this does no longer only happen
when trying/failing to ban a node.

(This used to be ctdb commit dc6b143c4785449e8c4ef7a46bf16adba750ab56)

commit | commitdiff | tree

Rusty Russell [Mon, 9 Aug 2010 06:11:32 +0000 (15:41 +0930)]

libctdb: add synchronous message handling and unregister, with tests.

It turns out that we *do* want a separate private arg for the message
handler and the completion callback, so we change that.

We also fix the prototypes of the remove_message functions as we
implement them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 332375246eccd95da626f434f6d49dd9458a9787)

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 9 Aug 2010 01:35:38 +0000 (11:35 +1000)]

Merge remote branch 'martins/master'

(This used to be ctdb commit 9ca09ee9129b787428a2ceac9731b12166dc8718)

commit | commitdiff | tree

Martin Schwenke [Fri, 6 Aug 2010 01:10:56 +0000 (11:10 +1000)]

Add some command-line options to ctdb_diagnostics.

In some contexts ctdb_diagnostics generates too many errors when it is
run on heterogeneous and machine-configured clusters.  In some
clusters some nodes are expected to be differently configured and also
machine-generated configured files can have comments containing
timestamps.

This adds some command-line options that can be used to reduce the
number of errors reported:

    -n <nodes>  Comma separated list of nodes to operate on
    -c          Ignore comment lines (starting with '#') in file comparisons
    -w          Ignore whitespace in file comparisons
    --no-ads    Do not use commands that assume an Active Directory Server

The -n option simply allows ctdb_diagnostics to operate on a subset of
nodes, avoiding file comparisons with and data collection on nodes
that are differently configured.  For file comparisons, instead of
showing each file on the current node and then comparing other nodes
to that file, the file from the first (available or requested) nodes
is shown and then other nodes are compared to that.  That has resulted
in changes in output - that is, ctdb diagnostics no longer prints
messages referencing the current node.

-c and -w are used to weaken comparisons between configuration files.

--no-ads can be used to avoid running ADS-specific commands if a
cluster uses LDAP (or other non-ADS) configuration.

This also fixes a number of bugs in related code:

* A call to onnode was losing the >> NODE ...  << lines because they
  now go to stderr.  This was changed in onnode long ago but
  ctdb_diagnostics was never updated to match.

* ctdb_diagnostics was counting lines in /etc/ctdb/nodes to determine
  what nodes to operate on.  For some time the nodes file has
  supported syntax that makes this invalid.  "ctdb listnodes -Y" is
  now used to list available nodes.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 36c8244a0f68c7c9bbee40982f230e9d14d3c0ea)

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 5 Aug 2010 06:35:37 +0000 (16:35 +1000)]

iupdate the docs that ctdb freeze is no more

(This used to be ctdb commit 79ef9909dfa0904d789c69eb6b9c80e8908a1100)

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 5 Aug 2010 06:30:47 +0000 (16:30 +1000)]

remove the "ctdb freeze" debugging command

(This used to be ctdb commit bd005b987255eb65cd3826dce984281ee757daf6)

commit | commitdiff | tree

Martin Schwenke [Thu, 5 Aug 2010 06:03:21 +0000 (16:03 +1000)]

Test suite: remove unnecessary verbosity from enable/continue tests.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 69c95b2a42f55b80cd8d91a90ab55166f964163b)

commit | commitdiff | tree

Martin Schwenke [Thu, 5 Aug 2010 06:01:23 +0000 (16:01 +1000)]

Test suite: Fix typo in continue test.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c2bce140da7c4b118394ee77bb9d0348d27e7e95)

commit | commitdiff | tree

Martin Schwenke [Thu, 5 Aug 2010 05:58:56 +0000 (15:58 +1000)]

Test suite: weaken ctdb continue/enable tests for non-deterministic IPs.

These tests currently wait for the old IPs to fail back to the test
node. This isn't guaranteed with DeterministicIPs disabled.

This changes those tests to wait until the test node gets at least 1
IP assigned.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e9b3f5b1b51d541a911a27eb4348b368f28d185e)

commit | commitdiff | tree

Martin Schwenke [Thu, 5 Aug 2010 05:29:40 +0000 (15:29 +1000)]

initscript: wait until we can ping ctdbd before setting tunables.

Currently we do a "sleep 1" after starting and before running
set_ctdb_variables to set the tunables.  This is too arbitrary and
might fail if the system is heavily loaded.  This, for example, could
result in some nodes running with DeterministicIPs and some without,
in which case a different IP allocation algorithm would run depending
on who is the recmaster!

This makes the start function wait until "ctdb ping" succeeds (with 10
second timeout) before trying to run set_ctdb_variables.  If a timeout
occurs then the start function attempts to kill ctdbd before exiting
with a failure.

It also cleans up the status reporting code for Red Hat and SUSE so
that the final status code is reported.  Currently there are cases
where a correct status is prematurely reported before a failure
occurs.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cdcd05662a30b51caaeeab4ac44138cac2474e0a)

commit | commitdiff | tree

Martin Schwenke [Thu, 5 Aug 2010 03:43:50 +0000 (13:43 +1000)]

Test suite - make the ctdb_fetch test cope with "Reqid wrap!" messages.

Recent CTDB notice the wrap and print this message. The test needs to
cope.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b93b60ec96d02ce4f54921e85a5c5554d1fc0c55)

commit | commitdiff | tree

Martin Schwenke [Thu, 5 Aug 2010 01:40:05 +0000 (11:40 +1000)]

Test suite: remove thaw/freeze tests.

They test debugging commands that no longer operate as expected.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d33fa4d6557aab1938049f194c2de55f2c395bd2)

commit | commitdiff | tree

Martin Schwenke [Wed, 4 Aug 2010 06:08:12 +0000 (16:08 +1000)]

Test suite - fix addip test.

The test currently checks that all existing IPs plus the newly added
IP are on the test node after "ctdb addip" is run.  With
DeterministicIPs enabled, if the new IP is "before" other IPs then the
other IPs may be shuffled by the deterministic IPs modulo algorithm.
This will happen on the 1st recovery after the move.  Sometimes this
recovery happens before we get the list of IPs to check and sometimes
after, so the test is racy.

The fix is to simply check for the presence of the new IP and not
worry about the others.  This reduces whatever value this test
had... but you can't have everything.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1ef7c8e64c7a39330be09ae4d00b70238133e0b5)

commit | commitdiff | tree

Martin Schwenke [Wed, 4 Aug 2010 06:05:39 +0000 (16:05 +1000)]

Merge remote branch 'martins/master'

(This used to be ctdb commit 5d9e4b6ee7d2b5290a74e7be79bdf51a43b72f43)

commit | commitdiff | tree

Martin Schwenke [Wed, 4 Aug 2010 03:16:06 +0000 (13:16 +1000)]

Test suite - try to make addip test more reliable and add some debugging.

This test is failing in some situations. The "ctdb addip" command
works but the IP never appears in the "ctdb ip" output.

Try restricting the last octet to be between 101-199. At the moment
addresses like 10.0.2.1 are being chosen and these are often the
address of the host machine in autocluster configurations... so might
cause weirdness.

Also add some debugging if checking for the IP address times out.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ae52cb63756bc60de8d32e01bac5d70975a1c7a0)

commit | commitdiff | tree

Martin Schwenke [Tue, 3 Aug 2010 01:51:14 +0000 (11:51 +1000)]

Testing: IP allocation simulation - add option to change odds of a failure.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b2a2e301025d7fbfe5eeaac436693cde6d404490)

commit | commitdiff | tree

Martin Schwenke [Tue, 3 Aug 2010 01:41:50 +0000 (11:41 +1000)]

Testing: IP allocation simulation - clean up usage message.

Group options better and make the language consistent between options.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit bc38c17e4115fae00c89d00537fdcfe621111b37)

commit | commitdiff | tree

Martin Schwenke [Tue, 3 Aug 2010 01:37:34 +0000 (11:37 +1000)]

Testing: IP allocation simulation - print maximum number of unhealthy nodes.

This can imply something about imbalance.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ecb80e2b6be9326708d1fc87ad3028c6836d5858)

commit | commitdiff | tree

Martin Schwenke [Tue, 3 Aug 2010 01:36:33 +0000 (11:36 +1000)]

Testing: IP allocation simulation - improve help for options.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 058501b92f602e7d2240d1cb08ed78a807564c48)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 05:46:23 +0000 (15:46 +1000)]

Testing: IP allocation simulation - make usage/failure more obvious.

Tweak the usage message for -g option.

Print an error if no node groups defined, instead of curious Python
error.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 8b883eb9346b8278d268e35b56ac680cd9526b97)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 05:09:13 +0000 (15:09 +1000)]

Testing: IP allocation simulation - rename an example to node_group_extra.py.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 974f849df0aca2cfedb38fa815894955e32803a8)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 05:07:56 +0000 (15:07 +1000)]

Testing: IP allocation simulation - rename an example to node_group_simple.py.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0a2a5602233a8208e2729192e50d816faed0151a)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 05:06:39 +0000 (15:06 +1000)]

Testing: IP allocation simulation - add general node group example.

This allows node pool configuration to be specifed on the
command-line.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d382d9023928f75f360a115ae1e9c1036423416e)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 05:01:47 +0000 (15:01 +1000)]

Testing: IP allocation simulation - update options processing in examples.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a65ca1a71386f40080dd553756f3600d3b20d523)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 04:58:15 +0000 (14:58 +1000)]

Testing: IP allocation simulation - Update README.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ed64b7f2b3cd920bb0f5dfd7f64ed0afc0b99fc1)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 04:24:00 +0000 (14:24 +1000)]

Testing: IP allocation simulation - fix nondeterminism in do_something_random().

The current code makes random choices from unsorted lists. This
ensures the lists are sorted.

Also, make the code easier to read by doing the random selction from
lists of PNNs rather than lists of Node objects.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a01244499dc3567f5aa934b1864b9bc183a6c242)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 04:20:12 +0000 (14:20 +1000)]

Testing: IP allocation simulation - Tweak options handling and Cluster.diff().

process_args() must now be called by programs inporting this module.
Options are put into global variable "options", which can be
references using "ctdb_takeover.options".

Can now pass extra option specifications to process_args().

Remove global variable prev and make it a Cluster object variable.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a32298e7bc819694518e859f100f9444ff5663cd)

commit | commitdiff | tree

Martin Schwenke [Mon, 2 Aug 2010 04:16:02 +0000 (14:16 +1000)]

Testing: IP allocation simulation - update copyright message.

There's a lot of new code here, so let's make the copyright message
make sense.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e6e56e5989def6704b116e806c1f261c7f3fc03f)

commit | commitdiff | tree

Martin Schwenke [Sun, 1 Aug 2010 01:53:28 +0000 (11:53 +1000)]

Testing: IP allocation simulation - add command line option for random seed.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 8362029c7cfc1041e46ee2116aa5cade6edce435)

commit | commitdiff | tree

Martin Schwenke [Sun, 1 Aug 2010 01:41:52 +0000 (11:41 +1000)]

Testing: IP allocation simulation - save some warnings for verbose mode.

We don't need to see warnings about unallocatable IPs unless we're in
verbose mode. Can node be run with -n (and without -v or -d) to see
just the statistics.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 55370936ac5def5ebf138910388a2ddc2df9c20f)

commit | commitdiff | tree

Martin Schwenke [Sun, 1 Aug 2010 01:41:02 +0000 (11:41 +1000)]

Testing: IP allocation simulation prints final imbalance in statistics.

This is useful to know. When things get unbalance they tend to stay
that way.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a40faa2096effc2657ac05b729f3259bbb2e1fed)

commit | commitdiff | tree

Martin Schwenke [Sun, 1 Aug 2010 01:39:30 +0000 (11:39 +1000)]

Testing: In IP allocation simulation count total number of events.

This starts at -1 because we always have to do the initial allocation.

No longer print event number for each event by default, only when
verbose is enabled.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c9a761726d141bcaa8ba7851150f71a8130b473a)

commit | commitdiff | tree

Martin Schwenke [Sun, 1 Aug 2010 01:37:35 +0000 (11:37 +1000)]

Testing: Add imbalance information to IP allocation simulation.

Implement the imbalance calculations.

Also add command-line option to display imbalance for each step.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f50a12f6d06ed67efadd2a892d62c01e67310e7d)

commit | commitdiff | tree

Martin Schwenke [Sat, 31 Jul 2010 10:34:45 +0000 (20:34 +1000)]

Merge branch 'master' of git://git.samba.org/sahlberg/ctdb

(This used to be ctdb commit 12e07ccb4b57aca8b5b1b38ce711c7755c67f106)

commit | commitdiff | tree

Martin Schwenke [Fri, 30 Jul 2010 06:45:36 +0000 (16:45 +1000)]

Testing: Add Python IP allocation simulation.

Includes simulation module and example scenarios. This allows you to
test and perhaps tweak an algorithm that should be the same as the
current CTDB IP reallocation one.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d148e7a7cb840febbdf56ba2e39c314cc2d7ac24)

commit | commitdiff | tree

Martin Schwenke [Mon, 26 Jul 2010 06:22:59 +0000 (16:22 +1000)]

Optimise 61.nfstickle to write the tickles more efficiently.

Currently the file for each IP address is reopened to append the
details of each source socket.

This optimisation puts all the logic into awk, including the matching
of output lines from netstat. The source sockets for each for each
destination IP are written into an array entry and then each array
entry is written to the corresponding file in a single operation.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 6549e9b01538998d51a5f72bfc569776d232b024)

commit | commitdiff | tree

Martin Schwenke [Mon, 7 Jun 2010 02:03:25 +0000 (12:03 +1000)]

Test suite: handle extra lines in statistics output.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b2362cc7773bb08c7dfdaf2c87d4b59460686659)

commit | commitdiff | tree

Martin Schwenke [Mon, 7 Jun 2010 02:29:31 +0000 (12:29 +1000)]

Test suite: handle change to disconnected node error message.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 20ea31e4ed893eb58cb2efa0b6fb13bcf4031918)

commit | commitdiff | tree

Martin Schwenke [Fri, 30 Jul 2010 06:45:36 +0000 (16:45 +1000)]

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 30 Jul 2010 06:37:22 +0000 (16:37 +1000)]

Add a code-style document.

Shamelessly sto^H^H^Hborrowed from samba3.

(This used to be ctdb commit 8024d9e2d589bfe4dee1cb9a79bec663738cb7fa)

commit | commitdiff | tree

Stefan Metzmacher [Fri, 30 Jul 2010 06:09:40 +0000 (08:09 +0200)]

events/10.interface: we need to mark interfaces as "up" if we don't know how to monitor them

metze

(This used to be ctdb commit 1e08d1578d1960fcfc5fdd85492fbd6d194e5e94)

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 30 Jul 2010 06:25:40 +0000 (16:25 +1000)]

Merge commit 'rusty/master'

(This used to be ctdb commit b4391c00476cde74101736986dfcd2be6c959edc)

commit | commitdiff | tree

Evan Kinney [Thu, 29 Jul 2010 02:48:46 +0000 (22:48 -0400)]

ctdb: Fixed use of reserved word "private" in typedefs

In include/ctdb.h, ctdb_callback_t and ctdb_rrl_callback_t were
defined with a void *private variable. The variable name was
changed to void *private_data to avoid issues encountered in
the Samba autoconf script.

Evan Kinney <evan.kinney@sas.com>

(This used to be ctdb commit 1f453aa4b5e749468c7788afac09c6f0900ea18f)

commit | commitdiff | tree

Martin Schwenke [Mon, 26 Jul 2010 06:22:59 +0000 (16:22 +1000)]

commit | commitdiff | tree

Martin Schwenke [Mon, 7 Jun 2010 02:03:25 +0000 (12:03 +1000)]

Test suite: handle extra lines in statistics output.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a476a56da2219c1047081032595c045f65f8ad3f)

commit | commitdiff | tree

Martin Schwenke [Mon, 7 Jun 2010 02:29:31 +0000 (12:29 +1000)]

Test suite: handle change to disconnected node error message.

Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d75d7b49cf729bace820b3225e5c6d069bbcbc53)

commit | commitdiff | tree

Stefan Metzmacher [Mon, 12 Jul 2010 12:11:41 +0000 (14:11 +0200)]

config/interface_modify.sh: do the echo before running the script

metze
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit bb1d2bd31073304fc203868517144f61d12b7fc2)

commit | commitdiff | tree

Stefan Metzmacher [Mon, 12 Jul 2010 12:05:51 +0000 (14:05 +0200)]

config/interface_modify.sh: before calling a script check if it exists and is executable

For non bash shells $_s_script might end with '/*'.

We do the workarround this way, because it makes sense to check
that a script is executable, before trying to execute it.

metze

[ This actually applies to any shell -- Rusty Russell ]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit e665cfde03fc9ec2264e99512ed5470872a2fd04)

commit | commitdiff | tree

Rusty Russell [Mon, 12 Jul 2010 05:41:42 +0000 (15:11 +0930)]

config: wrap iptables in flock to avoid concurrancy.

When doing a releaseip event, we do them in parallel for all the separate
IPs. This creates a problem for iptables, which isn't reentrant, giving
the strange message:
iptables encountered unknown error "18446744073709551615" while initializing table "filter"

The worst possible symptom of this is that releaseip won't remove the rule
which prevents us listening to clients during releaseip, and the node will be
healthy but non-responsive.

The simple workaround is to flock-wrap iptables. Better would be to rework
the code so we didn't need to use iptables in these paths.

CQ:S1018353
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 72d6914ee913272312d7b68f1be5ad05ad06587d)

commit | commitdiff | tree

Rusty Russell [Mon, 12 Jul 2010 06:38:37 +0000 (16:08 +0930)]

ctdb: fix crash on "ctdb scriptstatus --events=releaseip"

Martin accidentally typed this instead of "ctdb scriptstatus releaseip"
and it crashes.

CQ:S1018859
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 70877b2e7f8fd0d46899bbeca2c6caad6e6e6820)

commit | commitdiff | tree

Rusty Russell [Fri, 2 Jul 2010 03:21:08 +0000 (13:21 +1000)]

version: generate RPM version from git

This unifies our RPM version handling, based on tags.
1) Tags are of form ctdb-<version>.
2) The first <version> starts with .1.
3) Devel versions end with .0.<patchnum>.<checksum>.devel to reliably
identify them.

This means that devel versions will correctly supersede releases and earlier
devels, but new releases will correctly supersede older devel RPMs.

Making a new release is as simple as creating a new git tag.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 44009e02a661d4a1e14246f650974fc4ed7a07c9)

commit | commitdiff | tree

Rusty Russell [Thu, 1 Jul 2010 13:08:49 +0000 (23:08 +1000)]

Report client for queue errors.

We've been seeing "Invalid packet of length 0" errors, but we don't know
what is sending them. Add a name for each queue, and print nread.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit e6cf0e8f14f4263fbd8b995418909199924827e9)

commit | commitdiff | tree

Rusty Russell [Thu, 1 Jul 2010 08:33:18 +0000 (18:33 +1000)]

tdb: improve logging

When tdb throws an error, we didn't report the name of the tdb; we should.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit cfea357c9b2142c8cd8cac1ee712d40b188793e1)

commit | commitdiff | tree

Rusty Russell [Thu, 1 Jul 2010 11:46:55 +0000 (21:46 +1000)]

ctdb_freeze: extend db priority hack to cover serverid.tdb deadlock.

We discovered that recent smbd locks the serverid tdb while
holding a lock on another tdb (locking.tdb):
  7: POSIX  ADVISORY  WRITE smbd-2224318 locking.tdb.0 10600 10600
  22: -> POSIX  ADVISORY  READ  smbd-2224318 serverid.tdb.0 26580 26580

The result is a deadlock against the ctdb_freeze code called for
recovery.  We extend the "notify" workaround to this case, too.

BZ:65158
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit dfdaa446cf256854ff6d267dceeb86fbee8bb188)

commit | commitdiff | tree

Rusty Russell [Tue, 22 Jun 2010 13:25:20 +0000 (22:55 +0930)]

speed startup: with --sloppy-start, cut initial election timeout to 1/2 second.

Seconds between ctdbd first log message and node healthy:
BEFORE: 4.03
AFTER: 2.02

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 8f17731dea4287d4f9b21dc58c1bdf26c8a0e628)

commit | commitdiff | tree

Rusty Russell [Tue, 22 Jun 2010 13:22:34 +0000 (22:52 +0930)]

speed startup: add --sloppy-start.

The extra recovery interval wait was introduced in 821333afb458 but no
explanation was provided in that message. Nonetheless, if starting
the entire cluster for the first time, it should be safe to skip this.

We use the commandline arg --sloppy-start which should discourage
people from using it outside testing.

Seconds between ctdbd first log message and node healthy:
BEFORE: 16.10
AFTER: 4.03

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 509e2e89ae233a0e91998d95267bf62f296a73cd)

commit | commitdiff | tree

Rusty Russell [Tue, 22 Jun 2010 13:20:45 +0000 (22:50 +0930)]

speed startup: run startup immediately after recovery finished.

Seconds between ctdbd first log message and node healthy:
BEFORE: 17.08
AFTER: 16.10

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 372201d418f041d69646793105f6898ab12a7d91)

commit | commitdiff | tree

Rusty Russell [Tue, 22 Jun 2010 13:20:35 +0000 (22:50 +0930)]

speed startup: don't wait a full recovery interval if we've already waited

We currently sleep for one second, whether or not we've already slept.
Change this to sleep for the remainder of the second, if at all.

Seconds between ctdbd first log message and node healthy:
BEFORE: 18.09
AFTER: 17.08

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit fde760b5f39c77172308a583da4c2443b71541c9)

commit | commitdiff | tree

Rusty Russell [Tue, 22 Jun 2010 13:20:07 +0000 (22:50 +0930)]

speed startup: immediately run first monitor event after startup.

Once we've done a startup, we need to run a monitor event successfully
to be marked as healthy. Rather than wait the usual 5 seconds, run it
immediately (which will then reset next_interval to 5 seconds).

Seconds between ctdbd first log message and node healthy:
BEFORE: 23.58
AFTER: 18.09

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit c8651494febcb1c9e558b2002e2a72c2bf547c06)

commit | commitdiff | tree

Rusty Russell [Tue, 22 Jun 2010 13:20:23 +0000 (22:50 +0930)]

speed startup: alter recovery loop

We do a recovery on startup.  But the code does:
   Sleep for ctdb->tunable.recover_interval.
   Check for recovery.

We want to do it in the other order.  This is best done by extracting
the loop into a separate "main_loop" function.

Seconds between ctdbd first log message and node healthy:
BEFORE: 24.09
AFTER: 23.58

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 097046025176b9fcb670839d1a9f100f890e7ed2)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 06:39:16 +0000 (16:09 +0930)]

libctdb: test: run.sh script

This is a script which starts up a fake ctdbd and runs the libctdb
test suite.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 67ca040b07713d83385db63489c887f7156b7853)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 06:36:00 +0000 (16:06 +0930)]

libctdb: test: add readrecordlock support

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 1a23581c70a0c8c3b9c8fd4651ce1b2bb4464f97)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 05:30:46 +0000 (15:00 +0930)]

libctdb: test: add database save and restore

Once we do operations which alter the TDBs, we need to restore them to
pristine state after a failed child dies.

The method used here is a terrible hack: it should at least do a
tdb_lockall() on the database before blatting it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit d48ec16bd2b4932442d95fc43bea52baa0425501)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 05:18:54 +0000 (14:48 +0930)]

libctdb: test: --no-failtest

Sometimes you just want to test that the basic test case is sane,
without all the failure paths being tested.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit be7c0bffb0d924c3e72753045d5b85ce90407579)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 05:27:11 +0000 (14:57 +0930)]

libctdb: test: improve logging of failure paths

We include the file and line which called the functions, so the printed
failure path now looks like:

[malloc(ctdb.c:144)]:1:S[socket(ctdb.c:168)]:1:S...

The form is:
[ <function> ( <caller> ) ] : <input line> : <result>

<function> is the function which is called (eg. malloc).
<caller> is the file and line number which called <function>.
<input line> is the 1-based line number in the input which we were up to.
<result> is 'S' (success) or 'F' (failure).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 5fb6da30b5b5a8b761c8ab9a8124b87b759ef055)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 05:32:05 +0000 (15:02 +0930)]

libctdb: test: logging enhancement

Make children log through a pipe to the parent, which then spits it out
only if the child has a problem.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 8ac006cf6c6cbfd3fe1606178eb0f0127d33f632)

commit | commitdiff | tree

Rusty Russell [Fri, 16 Jul 2010 04:42:40 +0000 (14:12 +0930)]

libctdb: test infrastructure

This introduces 'ctdb-test', a program for testing libctdb. It takes
commands on standard input (with reduced functionality) or an input file.

It still needs some cleaning up, but you can uncover a bug in libctdb
today simply by running a simple attachdb test:

$ ctdb-test tests/attachdb1.txt

It will print out a crash, and the path of successful and failed
operations which lead to it:

...
Child signalled 11 on failure path: [malloc]:1:S[socket]:1:S[connect]:1:S[malloc]:1:S[malloc]:1:S[malloc]:1:S[malloc]:4:S[malloc]:4:F

Feed that failure path into ctdb-test using --failpath (under a debugger):

gdb --args ctdb-test tests/attachdb1.txt --failpath=[malloc]:1:S[socket]:1:S[connect]:1:S[malloc]:1:S[malloc]:1:S[malloc]:1:S[malloc]:4:S[malloc]:4:F

And you hit the exact error.

It is based on the fork-to-fail model of nfsim. The relevant parts are
from page 154 of the proceedings of 2005 Ottawa Linux Symposium Volume II:
http://www.linuxsymposium.org/2005/linuxsymposium_procv2.pdf

Or our presentation of same (from slide 21):
http://ozlabs.org/~jk/projects/nfsim/nfsim.sxi

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit b4aab4199a57898877b6545a54f212087ed4b35a)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 05:17:34 +0000 (14:47 +0930)]

libctdb: implement synchronous readrecordlock interface.

Because this doesn't use a generic callback, it's not quite as trivial
as the other sync wrappers.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 1f20b938d46d4fcd50d2b473c1ab8dc31d178d2d)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 06:05:52 +0000 (15:35 +0930)]

libctdb: implement ctdb_disconnect and ctdb_detachdb

These are important for testing, since we can easily tell if we
leak memory if there are outstanding allocations after calling
these.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 18a212aa40d0ff9ff59775c6fcf9dc973e991460)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 06:18:48 +0000 (15:48 +0930)]

libctdb: fix io_elem resource leak on realloc failure.

Found by nfsim.

I knew about this, but as we stop when it happens anyway I didn't fix
it. But it bugs nfsim, so fix it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 936b02443d36306407d6a26e8037cf31e3190b32)

commit | commitdiff | tree

Rusty Russell [Mon, 21 Jun 2010 05:15:37 +0000 (14:45 +0930)]

libctdb: fix writerecord() to actually write the record.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 680ee6afaa89f21115a1bf33a8b9e7e92084a1a1)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 05:43:54 +0000 (15:13 +0930)]

libctdb: ctdb_service() never returns < 0

Found by ctdb-test.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 0e8210f19edf2ae14154afb85d9b96951881f31f)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 05:45:11 +0000 (15:15 +0930)]

libctdb: check ctdb_request_free & ctdb_cancel used appropriately.

Since I made this mistake myself, we should check for it.

We could have one function that does both, but from a user's point of
view they are very different and it's quite possibly a bug if they
think the request is finished/unfinished when it's not.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 70f6ed2634fb10749cdad3deffa96a1aa439c235)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 05:45:27 +0000 (15:15 +0930)]

libctdb: synchronous should be using ctdb_cancel to kill unfinished requests.

Found by ctdb-test.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit cd6b2f46075bfb64561496960af7fc2e95500e52)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 06:17:23 +0000 (15:47 +0930)]

libctdb: fix uninitialized field usage on ctdb_attach failure path

Found by ctdb-test.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 54c1036090d930c19231038ca861297153c1d0cf)

commit | commitdiff | tree

Rusty Russell [Fri, 18 Jun 2010 06:17:14 +0000 (15:47 +0930)]

libctdb: removed unused lock field from struct ctdb_db

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 256653a223c48ed932ce85f89fc2c2dda14f8c27)

Michael's Samba GIT

RSS Atom