metze/ctdb/wip.git
9 years agoMerge commit 'martins/master'
Ronnie Sahlberg [Thu, 22 Oct 2009 23:43:13 +0000 (10:43 +1100)]
Merge commit 'martins/master'

9 years agonew version 1.0.99
Ronnie Sahlberg [Thu, 22 Oct 2009 07:16:33 +0000 (18:16 +1100)]
new version 1.0.99

9 years agoMerge commit 'origin/master'
Martin Schwenke [Thu, 22 Oct 2009 06:48:09 +0000 (17:48 +1100)]
Merge commit 'origin/master'

9 years agoDocument onnode -n and -f options.
Martin Schwenke [Thu, 22 Oct 2009 06:47:10 +0000 (17:47 +1100)]
Document onnode -n and -f options.

Signed-off-by: Martin Schwenke <martin@meltin.net>
9 years agoif a lock wait child died/finished, we could have released the lockwait handle and...
Ronnie Sahlberg [Thu, 22 Oct 2009 02:41:28 +0000 (13:41 +1100)]
if a lock wait child died/finished, we could have released the lockwait handle and set it to NULL before we call the destructors for releaseing the waiters.

The waiters reference the locakwait handle in order to remove itself from the li
nked list which caused a SEGV.

We dont actually need to remove ourselves from this list here since
if the parent freeze_handle holding the list is freed, then all waiters are rele
ased as well, and the only place we actually need to relink the waiter is in ctd
b_freeze_lock_handler, where we want to respond back to the clients and release
the waiters  but we still want to keep the freeze_handle hanging around.

9 years agoFrom Volker L
Ronnie Sahlberg [Thu, 22 Oct 2009 01:19:40 +0000 (12:19 +1100)]
From Volker L
Fix some warnings  and an incorrect check for a talloc failure

10 years agoFrom Wolfgang M.
Ronnie Sahlberg [Wed, 21 Oct 2009 20:58:44 +0000 (07:58 +1100)]
From Wolfgang M.

With the new vacuuming code, dont treat an invalid dmaster as fatal. Let it update to the new value insetad.

10 years agoMerge commit 'origin/master'
Martin Schwenke [Wed, 21 Oct 2009 10:48:15 +0000 (21:48 +1100)]
Merge commit 'origin/master'

10 years agoTest suite: Remove the disable/enable monitor tests - they are useless.
Martin Schwenke [Wed, 21 Oct 2009 10:47:06 +0000 (21:47 +1100)]
Test suite: Remove the disable/enable monitor tests - they are useless.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoTest suite: Fix the timeouts on the skip share check tests.
Martin Schwenke [Wed, 21 Oct 2009 10:36:39 +0000 (21:36 +1100)]
Test suite: Fix the timeouts on the skip share check tests.

The timeout for waiting for state changes isn't very predictable.  It
is "about" MonitorInterval seconds...  but can be longer given the
duration of eventscript runs and other things.  So, we change the
timeout to MonitorInterval + EventScriptTimeout, hoping it never takes
that long.

Move the eventscript installation/removal from the old fake-tests into
a function in the functions file.  Implement supporting functions to
create/remove/check-for various files that it handles.  Also add a
function that uses all of this that waits for the next monitor event
(but only if all other monitor events pass).

The final check in the skip share check tests uses the above and waits
for a monitor event, and then checks that the node is still healthy.

Also enhance the wait_until function to handle a command starting with
'!' (as a separate word) to make it easy to wait for a file not to
exist.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoDuring tests it is common to add/delete test eventscripts at runtime.
Ronnie Sahlberg [Wed, 21 Oct 2009 05:50:39 +0000 (16:50 +1100)]
During tests it is common to add/delete test eventscripts at runtime.
This can race with teh eventascript handling that does a :

list all scripts,   sort them,  then execute them

so trap status code 127 which means the script could not be executed (or /bin/sh does not exist) and treat it as not to cause the node to become unhealthy

10 years agolower the debug levels for the "create FD messages" so we dont fill up the logs.
Ronnie Sahlberg [Wed, 21 Oct 2009 04:26:24 +0000 (15:26 +1100)]
lower the debug levels for the "create FD messages" so we dont fill up the logs.

10 years ago When clients have blocked, perhaps because the node is banned or stopped and...
Ronnie Sahlberg [Wed, 21 Oct 2009 04:20:55 +0000 (15:20 +1100)]
When clients have blocked, perhaps because the node is banned or stopped and the client is blocked trying to tdb_fetch() a record, make sure we dont queue up too many REQ_MESSAGES.

    Add a new tunable to control the maximum queue size we allow to a blocked client before we start discarding REQ_MESSAGES instead of queueing them for delivery.

    This avoids having queued up very very large number of MESSAGES that samba semds
     between eachother to nodes that are blocked/banned/stopped for extended periods
    .

10 years agodont restart ctdb when installing the rpm
Ronnie Sahlberg [Wed, 21 Oct 2009 02:54:02 +0000 (13:54 +1100)]
dont restart ctdb when installing the rpm

10 years agoIn ctdb_ltdb_store(), add a missing transaction_cancel when local store failed.
Michael Adam [Tue, 20 Oct 2009 14:57:23 +0000 (16:57 +0200)]
In ctdb_ltdb_store(), add a missing transaction_cancel when local store failed.

Spotted by Volker.

Michael

10 years agomprove the log message when we skip the ip allocation check from the recovery daemon.
Ronnie Sahlberg [Wed, 21 Oct 2009 00:51:30 +0000 (11:51 +1100)]
mprove the log message when we skip the ip allocation check from the recovery daemon.

we also skip this check if we are already in the process of performing an ip reallocation and not only when we are performing a full recovery.

10 years agotreat interfaces with the name ethX* as bond devices
Ronnie Sahlberg [Wed, 21 Oct 2009 00:34:17 +0000 (11:34 +1100)]
treat interfaces with the name ethX* as bond devices

10 years agoTest suite: A timeout of MonitorInterval seconds sometimes isn't enough.
Martin Schwenke [Tue, 20 Oct 2009 06:11:01 +0000 (17:11 +1100)]
Test suite: A timeout of MonitorInterval seconds sometimes isn't enough.

Monitor events sometimes happen a little bit more than MonitorInterval
seconds apart.  This changes some timeouts to MonitorInterval + 1
seconds.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoMerge commit 'origin/master'
Martin Schwenke [Tue, 20 Oct 2009 05:53:04 +0000 (16:53 +1100)]
Merge commit 'origin/master'

10 years agoTest suite: New tests for validating SKIP_SHARE_CHECK options.
Martin Schwenke [Tue, 20 Oct 2009 05:52:22 +0000 (16:52 +1100)]
Test suite: New tests for validating SKIP_SHARE_CHECK options.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoTest suite: Update 99_ctdb_uninstall_eventscript.sh to use ctdb_init().
Martin Schwenke [Tue, 20 Oct 2009 05:51:06 +0000 (16:51 +1100)]
Test suite: Update 99_ctdb_uninstall_eventscript.sh to use ctdb_init().

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoTest suite: Fix bug in node_has_status().
Martin Schwenke [Tue, 20 Oct 2009 05:45:29 +0000 (16:45 +1100)]
Test suite: Fix bug in node_has_status().

This function has been broken since it was updated to work with the
"stopped" state (probably commit
67c5bfb5f02c9d45a32d976021ede4fb2174dfe9).  Although ${var#:*:0}
removes the shortest matching prefix of $var, '*' can match substrings
that include ':' if '0' isn't where you expect.  So we were making
unexpected matches and incorrectly returning true for some cases.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoTest suite: add -x option to ctdb_init() function.
Martin Schwenke [Tue, 20 Oct 2009 05:44:44 +0000 (16:44 +1100)]
Test suite: add -x option to ctdb_init() function.

This facilitates tracing of tests.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoversion 1.0.98
Ronnie Sahlberg [Tue, 20 Oct 2009 04:36:35 +0000 (15:36 +1100)]
version 1.0.98

10 years agoFrom Wolfgang Mueller
Ronnie Sahlberg [Tue, 20 Oct 2009 02:01:15 +0000 (13:01 +1100)]
From Wolfgang Mueller

make sure to always create the vactun database and get rid of some annoying log messages

10 years agoFrom wolfgang Mueller
Ronnie Sahlberg [Tue, 20 Oct 2009 01:59:48 +0000 (12:59 +1100)]
From wolfgang Mueller

Add a tuneable so that when scripts starts to hang/timeout, we can make the node unhealthy instead of banned

10 years agoMerge commit 'origin/master'
Martin Schwenke [Mon, 19 Oct 2009 05:46:45 +0000 (16:46 +1100)]
Merge commit 'origin/master'

10 years agoadd a direcotry where multiple local scripts can be added to run when executing event...
Ronnie Sahlberg [Mon, 19 Oct 2009 05:22:15 +0000 (16:22 +1100)]
add a direcotry where multiple local scripts can be added to run when executing eventscripts

10 years agowait a bit longer before shutting down when the reclock file is missing
Ronnie Sahlberg [Mon, 19 Oct 2009 04:33:20 +0000 (15:33 +1100)]
wait a bit longer before shutting down when the reclock file is missing

pring the filename of the missing file when we turn unhealthy and also
a 'df'

10 years agoRevert "dont shutdown a node when the reclock file is temporarily unavailable."
Ronnie Sahlberg [Mon, 19 Oct 2009 04:30:44 +0000 (15:30 +1100)]
Revert "dont shutdown a node when the reclock file is temporarily unavailable."

This reverts commit f5e9f3007c10a937158bc8cdfabf33c984cf9c50.

10 years agoMerge branch 'onnode_options'
Martin Schwenke [Fri, 16 Oct 2009 05:39:46 +0000 (16:39 +1100)]
Merge branch 'onnode_options'

10 years agoMerge commit 'origin/master'
Martin Schwenke [Fri, 16 Oct 2009 05:36:48 +0000 (16:36 +1100)]
Merge commit 'origin/master'

10 years agoinitscript: when stopping on Red Hat use the success/failure functions.
Martin Schwenke [Fri, 16 Oct 2009 05:35:56 +0000 (16:35 +1100)]
initscript: when stopping on Red Hat use the success/failure functions.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoDont run eventscript monitor when the databases are frozen.
Ronnie Sahlberg [Thu, 15 Oct 2009 05:03:43 +0000 (16:03 +1100)]
Dont run eventscript monitor when the databases are frozen.
The databases can become frozen a while before we do the actual recovery
since we have the re-recovery timeout.

There is no point in doing much monitoring if we are waiting for a recovery,
or if we are banned.
This will eliminate some annoying log entries where certain tests will fail if the databases are locked.

10 years agodont shutdown a node when the reclock file is temporarily unavailable.
Ronnie Sahlberg [Thu, 15 Oct 2009 02:19:10 +0000 (13:19 +1100)]
dont shutdown a node when the reclock file is temporarily unavailable.
Leave the node as UNHEALTHY this stops clients from accessing the node until
the reclock file can be accessed again

10 years agoadd logging everytime we create a filedescriptor in the main ctdb daemon
Ronnie Sahlberg [Thu, 15 Oct 2009 00:24:54 +0000 (11:24 +1100)]
add logging everytime we create a filedescriptor in the main ctdb daemon
so we can spot if there are leaks.

plug two leaks for filedescriptors related to when sending ARP fail
and one leak when we can not parse the local address during tcp connection establish

10 years agonew version 1.0.97
Ronnie Sahlberg [Wed, 14 Oct 2009 20:41:56 +0000 (07:41 +1100)]
new version 1.0.97

10 years agoMerge commit 'martins/onnode_options'
Ronnie Sahlberg [Wed, 14 Oct 2009 04:51:57 +0000 (15:51 +1100)]
Merge commit 'martins/onnode_options'

10 years agoversion 1.0.96
Ronnie Sahlberg [Wed, 14 Oct 2009 03:52:24 +0000 (14:52 +1100)]
version 1.0.96

10 years agoadd more debugging output to eventscripts and when a script has timed out,
Ronnie Sahlberg [Wed, 14 Oct 2009 03:14:28 +0000 (14:14 +1100)]
add more debugging output to eventscripts and when a script has timed out,
print a full "pstree -p" to the log.

Example :
        |-ctdbd(29826)-+-ctdbd(29862)
        |              `-ctdbd(31897)-+-00.ctdb(31898)---sleep(31908)

change the default timeout to 60 seconds for eventscripts

10 years agoMerge commit 'origin/master' into onnode_options
Martin Schwenke [Wed, 14 Oct 2009 02:49:30 +0000 (13:49 +1100)]
Merge commit 'origin/master' into onnode_options

10 years agoNew onnode options: -f to specify nodes file, -n to allow use of hostnames.
Martin Schwenke [Wed, 14 Oct 2009 02:44:57 +0000 (13:44 +1100)]
New onnode options: -f to specify nodes file, -n to allow use of hostnames.

The -f option allows an alternate nodes file to be specified,
overriding the CTDB_NODES_FILE environment variable.

The -n option allows hostnames to be used instead of node numbers.
Using a range of hostnames is invalid, so hostnames can't contain
hyphens ('-') - sorry!  You can use this option without a nodes file
by specifying "-f /dev/null".

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agomove the logging of the warning "No reclock file used" to the startup case so we...
Ronnie Sahlberg [Wed, 14 Oct 2009 01:12:04 +0000 (12:12 +1100)]
move the logging of the warning "No reclock file used" to the startup case so we only print this warning on "service ctdb start" and not for "service ctdb *"

10 years agowhen we change state between healthy/unhealthy, make sure we ask the recovery
Ronnie Sahlberg [Wed, 14 Oct 2009 00:59:16 +0000 (11:59 +1100)]
when we change state between healthy/unhealthy, make sure we ask the recovery
master to perform an explicit ip reallocation.

This is more reliable and faster than having the recovery dameon track these
changes, and since we now have an explicit method to ask the recovery daemon
to perform an explicit ip reallocation, we should use this.

10 years agoallow a pre .95 version of a recovery master to freeze databases on a post .95 node...
Ronnie Sahlberg [Tue, 13 Oct 2009 23:14:03 +0000 (10:14 +1100)]
allow a pre .95 version of a recovery master to freeze databases on a post .95 node by remapping priority numbers and log this to log.ctdb

10 years agoalways create the nfs state directories during the monitor event.
Ronnie Sahlberg [Tue, 13 Oct 2009 22:15:24 +0000 (09:15 +1100)]
always create the nfs state directories during the monitor event.
this allows us to configure and enable nfs at runtime without having to restart ctdbd

10 years agoPort Volkers deadlock avoidance patch to HEAD.
Ronnie Sahlberg [Tue, 13 Oct 2009 21:17:49 +0000 (08:17 +1100)]
Port Volkers deadlock avoidance patch to HEAD.
This patch ensures that we lock all non-notify related databases first and
then the notify databases to avoiud a deadlock where samba needs to lock records on two databases at once (and notify being the second database).

Newer versions of samba would instead use the set-db-prio control to set this explicitely on a database per database basis instead of relying on  hardcoded database names. This patch will be reverted in the future when all updated versions of samba has been pushed out.

10 years agowe must break the loop as soon as we find a suitable recmaster does exist
Ronnie Sahlberg [Mon, 12 Oct 2009 22:49:05 +0000 (09:49 +1100)]
we must break the loop as soon as we find a suitable recmaster does exist
otherwise "tdb ipreallocate" will silently fail to update the addresses.

10 years agonew version 1.0.95
Ronnie Sahlberg [Mon, 12 Oct 2009 07:53:20 +0000 (18:53 +1100)]
new version 1.0.95

10 years agouse the correct expected size for thew _cancel control
Ronnie Sahlberg [Mon, 12 Oct 2009 07:41:57 +0000 (18:41 +1100)]
use the correct expected size for thew _cancel control

10 years agoadd a dispatch to the recovery transaction cancel call
Ronnie Sahlberg [Mon, 12 Oct 2009 07:31:59 +0000 (18:31 +1100)]
add a dispatch to the recovery transaction cancel call

10 years agoMerge commit 'martins/master'
Ronnie Sahlberg [Mon, 12 Oct 2009 05:51:36 +0000 (16:51 +1100)]
Merge commit 'martins/master'

10 years agoadd a new control for explicitely cancelling recovery transactions, i.e. the
Ronnie Sahlberg [Mon, 12 Oct 2009 05:48:05 +0000 (16:48 +1100)]
add a new control for explicitely cancelling recovery transactions, i.e. the
transactions we start across all tdb databased during the recovery.

this allows us to properly clean up and delete these tdb transactions on a
recovery failure.

10 years agoClean up ctdb_check_directories* eventscript functions.
Martin Schwenke [Mon, 12 Oct 2009 05:32:49 +0000 (16:32 +1100)]
Clean up ctdb_check_directories* eventscript functions.

There are 2 problems with this code:

* The loop in ctdb_check_directories_probe() breaks on filenames
  containing whitespace.

  The fix to protect them is to pass "$@" to this function and have it
  operate on "$@".

  Note that there's still a problem with whitespace in filenames in
  the 50.samba eventscript.  To fix this ctdb_check_directories_probe
  should read the filenames from stdin.  Another time...

* The check for '%' in filenames in ctdb_check_directories_probe()
  ends up involving several forks.  On a modern machine this can cost
  a couple of minutes when checking a large number of directories.

  The fix is to use a case statement.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years ago40.vsftpd: reset the fail counter in the "recovered" event.
Martin Schwenke [Mon, 12 Oct 2009 05:17:37 +0000 (16:17 +1100)]
40.vsftpd: reset the fail counter in the "recovered" event.

Each recovery that involves IP reassignments results in a restart of
vsftpd in the "recovered" event.  Currently, we can have several
recoveries in quick succession and the "monitor" event following each
can fail because vsftpd isn't ready yet.  This results in cumulative
failures, so the node is marked unhealthy, even though vsftpd has
never had a proper opportunity to become ready.

This resets the fail count after each recovery.

While we're here, also move the delete of the restart flag file into
the body of the conditional.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoallow setting the recmode even when not completely frozen.
Ronnie Sahlberg [Mon, 12 Oct 2009 02:06:16 +0000 (13:06 +1100)]
allow setting the recmode even when not completely frozen.
we sometimes have to do this when we want to trigger a recovery

10 years agoinitial attempt at freezing databases in priority order
Ronnie Sahlberg [Mon, 12 Oct 2009 01:08:39 +0000 (12:08 +1100)]
initial attempt at freezing databases in priority order

10 years agouptade the freeze/thaw commands to be able to send the requested database priority...
Ronnie Sahlberg [Sun, 11 Oct 2009 22:22:17 +0000 (09:22 +1100)]
uptade the freeze/thaw commands to be able to send the requested database priority to freeze/thaw to the daemon.

this is encoded in the srvid field of the request header

10 years agoduring recovery, update all remote nodes so they use the same priorities
Ronnie Sahlberg [Sat, 10 Oct 2009 05:28:20 +0000 (16:28 +1100)]
during recovery, update all remote nodes so they use the same priorities
for the databases as this node.

10 years agoadd a control to read the db priority from a database
Ronnie Sahlberg [Sat, 10 Oct 2009 04:04:18 +0000 (15:04 +1100)]
add a control to read the db priority from a database

10 years agoadd a control to set a database priority. Let newly created databases default to...
Ronnie Sahlberg [Sat, 10 Oct 2009 03:26:09 +0000 (14:26 +1100)]
add a control to set a database priority. Let newly created databases default to priority 1.

database priorities will be used to control in which order databases are locked during recovery in.

10 years agoverify the DISABLED flag and compare with the previous flag we have registered for...
Ronnie Sahlberg [Sat, 10 Oct 2009 02:55:11 +0000 (13:55 +1100)]
verify the DISABLED flag and compare with the previous flag we have registered for that node and not what the node says is the difference.

this prevents a situation where the remove node may cause spurious ip reallocations.

10 years agoFix bug spotted by Metze,
Ronnie Sahlberg [Fri, 9 Oct 2009 11:22:11 +0000 (22:22 +1100)]
Fix bug spotted by Metze,

the argument to ctdb_control_event_Script_disabled() is a string not a uint32

10 years agoversion 1.0.94
Ronnie Sahlberg [Thu, 8 Oct 2009 08:17:57 +0000 (19:17 +1100)]
version 1.0.94

10 years agoif a node fails to become frozen during recovery, mark it up with as a culprit so...
Ronnie Sahlberg [Thu, 8 Oct 2009 05:45:25 +0000 (16:45 +1100)]
if a node fails to become frozen during recovery, mark it up with as a culprit so it will soon get banned

10 years agoversion 1.0.93
Ronnie Sahlberg [Tue, 6 Oct 2009 06:05:14 +0000 (17:05 +1100)]
version 1.0.93

10 years agoupdate natgw eventscript to allow you to fore it to update and / or to remove the...
Ronnie Sahlberg [Tue, 6 Oct 2009 05:09:24 +0000 (16:09 +1100)]
update natgw eventscript to allow you to fore it to update and / or to remove the configuration at runtime

10 years agoMerge commit 'origin/master'
Martin Schwenke [Tue, 6 Oct 2009 02:39:31 +0000 (13:39 +1100)]
Merge commit 'origin/master'

10 years agoDocument CTDB_NODES_FILE environment variable used by onnode.
Martin Schwenke [Tue, 6 Oct 2009 02:38:00 +0000 (13:38 +1100)]
Document CTDB_NODES_FILE environment variable used by onnode.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoalways send the release/take ip controls to make sure all nodes are updated
Ronnie Sahlberg [Tue, 6 Oct 2009 01:25:44 +0000 (12:25 +1100)]
always send the release/take ip controls to make sure all nodes are updated

10 years agoadd a new message to ask the recovery daemon to temporarily disable checking ip addre...
Ronnie Sahlberg [Tue, 6 Oct 2009 01:11:32 +0000 (12:11 +1100)]
add a new message to ask the recovery daemon to temporarily disable checking ip address consistency.

This is useful when we are moving addresses using moveip in the cluster since otherwise if we collide with the recovery daemons own check we could cause a recovery

10 years agoupdate addip/moveip/delip to make it less likely to trigger an accidental recovery
Ronnie Sahlberg [Tue, 6 Oct 2009 00:41:18 +0000 (11:41 +1100)]
update addip/moveip/delip to make it less likely to trigger an accidental recovery

10 years agochange some loglevels and also pront the pnn of the ip for takeip/releaseip logging
Ronnie Sahlberg [Tue, 6 Oct 2009 00:40:38 +0000 (11:40 +1100)]
change some loglevels and also pront the pnn of the ip for takeip/releaseip logging

10 years agoadd a new function to collect a list of all active nodes EXCEPT a certain node
Ronnie Sahlberg [Mon, 5 Oct 2009 23:52:31 +0000 (10:52 +1100)]
add a new function to collect a list of all active nodes EXCEPT a certain node

10 years agoallocate takeoverip state as a child of vnn and also make the takeocerip context...
Ronnie Sahlberg [Mon, 5 Oct 2009 22:35:15 +0000 (09:35 +1100)]
allocate takeoverip state as a child of vnn and also make the takeocerip context a child of vnn

10 years agoWhen adding a public ip to a node, make sure to push the assignment of ip addresses...
Ronnie Sahlberg [Mon, 5 Oct 2009 21:19:25 +0000 (08:19 +1100)]
When adding a public ip to a node, make sure to push the assignment of ip addresses out to all nodes so all nodes become aware who currently holds the ip.

10 years agoversion 1.0.92
Ronnie Sahlberg [Fri, 2 Oct 2009 04:38:16 +0000 (14:38 +1000)]
version 1.0.92

10 years agowe should close this file on exec
Ronnie Sahlberg [Fri, 2 Oct 2009 03:41:54 +0000 (13:41 +1000)]
we should close this file on exec

10 years agoMerge commit 'martins/master'
Ronnie Sahlberg [Thu, 1 Oct 2009 05:46:01 +0000 (15:46 +1000)]
Merge commit 'martins/master'

10 years agoTest suite: The ctdb ping test should allow time to go backwards.
Martin Schwenke [Thu, 1 Oct 2009 05:39:09 +0000 (15:39 +1000)]
Test suite: The ctdb ping test should allow time to go backwards.

Time can actually go backwards during this test if ntpd happens to
adjust it little bit.  So we should cope...

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodont exit on a commit failure
Ronnie Sahlberg [Thu, 1 Oct 2009 04:53:35 +0000 (14:53 +1000)]
dont exit on a commit failure

10 years agoRevert "Revert "allow the transaction commit to fail""
Ronnie Sahlberg [Thu, 1 Oct 2009 04:51:32 +0000 (14:51 +1000)]
Revert "Revert "allow the transaction commit to fail""

This reverts commit 74e416108df6934f45ca646d709785dd76ab3c35.

10 years agodocument how to use the notification script
Ronnie Sahlberg [Thu, 1 Oct 2009 04:31:55 +0000 (14:31 +1000)]
document how to use the notification script

10 years agoadd a new notification to trigger on when ctdb has started
Ronnie Sahlberg [Thu, 1 Oct 2009 04:05:30 +0000 (14:05 +1000)]
add a new notification to trigger on when ctdb has started

10 years agoMinor fixes to 01.reclock eventscript.
Martin Schwenke [Wed, 30 Sep 2009 11:21:56 +0000 (21:21 +1000)]
Minor fixes to 01.reclock eventscript.

test -z really needs its argument to be quoted.  Simplified a status
test.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years ago40.vsftpd monitor event only fails after 2 failures to connect to port 21.
Martin Schwenke [Wed, 30 Sep 2009 11:05:16 +0000 (21:05 +1000)]
40.vsftpd monitor event only fails after 2 failures to connect to port 21.

Change the monitor event in 40.vsftpd so it only fails if there are 2
successive failures connecting to port 21.  This reduces the
likelihood of unhealthy nodes due to vsftpd being restarted for
reconfiguration due to node failover or system reconfiguration.

New eventscript functions ctdb_counter_init, ctdb_counter_incr,
ctdb_counter_limit.  These are used to count arbitrary things in
eventscripts, depending on the eventscript name and a tag that is
passed, and determine if a specified limit has been hit.  They're good
for counting failures!

These functions are used in 40.vsftpd and also in 01.reclock - the
latter used to do the counting without these functions.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoMerge commit 'origin/master'
Martin Schwenke [Wed, 30 Sep 2009 09:22:59 +0000 (19:22 +1000)]
Merge commit 'origin/master'

10 years agoNew version 1.0.91
Ronnie Sahlberg [Tue, 29 Sep 2009 03:31:41 +0000 (13:31 +1000)]
New version 1.0.91

10 years agoFrom Wolfgang Mueller-Friedt
Ronnie Sahlberg [Tue, 29 Sep 2009 03:20:18 +0000 (13:20 +1000)]
From Wolfgang Mueller-Friedt

Remove the explicit vacuum/repack commands from the 00.ctdb eventscript
and implement this in the ctdb daemon.

Combine vacuuming and repacking into one
cheap read traverse to enumerate all candidate records
and one write traverse that both repacks the database and also deletes the record locally where we are lmaster and where the records have already been deleted remotely.

this code also adds initial autotuning heuristics for the vacuum intervals and how many records to delete in each iteration.

minor stylish changes made by ronnie s

10 years agoMerge commit 'origin/master'
Martin Schwenke [Tue, 29 Sep 2009 02:59:10 +0000 (12:59 +1000)]
Merge commit 'origin/master'

10 years agochange the reclock fail count to 19 monitor intervals before we shut down ctdbd
Ronnie Sahlberg [Mon, 28 Sep 2009 04:12:59 +0000 (14:12 +1000)]
change the reclock fail count to 19 monitor intervals before we shut down ctdbd

10 years ago add a new eventscript 01.reclock
Ronnie Sahlberg [Mon, 28 Sep 2009 04:06:40 +0000 (14:06 +1000)]
add a new eventscript 01.reclock

    if the reclock file has been set, then this script will test that the
    reclock file can actually be accessed.
    if the file does not exist, or if the attempts to stat the file hangs,
    the node will be marked unhealthy after the third failed monitoring event
    and after the tenth failure, ctdb itself will shutdown.

10 years agoadd machinereadable output for the ctdb getreclock command
Ronnie Sahlberg [Mon, 28 Sep 2009 03:39:54 +0000 (13:39 +1000)]
add machinereadable output for the ctdb getreclock command

10 years agoTest suite: Print debug info on node status timeouts.
Martin Schwenke [Fri, 25 Sep 2009 08:00:17 +0000 (18:00 +1000)]
Test suite: Print debug info on node status timeouts.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoMerge commit 'obnox/master-rebase'
Ronnie Sahlberg [Fri, 25 Sep 2009 07:34:59 +0000 (17:34 +1000)]
Merge commit 'obnox/master-rebase'

10 years agoMerge root@10.1.1.27:/shared/ctdb/ctdb-git
Ronnie Sahlberg [Fri, 25 Sep 2009 03:18:18 +0000 (13:18 +1000)]
Merge root@10.1.1.27:/shared/ctdb/ctdb-git

10 years agowith the new banning logic with one struct for each node we no longer "forget" the...
Ronnie Sahlberg [Fri, 25 Sep 2009 03:14:53 +0000 (13:14 +1000)]
with the new banning logic with one struct for each node we no longer "forget" the other culprits as often as we used to do, which means that things like "ctdb recover" can now actually lead to a node becomming banned if we perform too many recoveries too frequently.

change this to provide absolution to all nodes once they have participated in a recovery session.

10 years agoRevert "dont check if commit failed, we do allow the commit to fail sometimes"
Michael Adam [Thu, 10 Sep 2009 14:21:01 +0000 (16:21 +0200)]
Revert "dont check if commit failed, we do allow the commit to fail sometimes"

This reverts commit affa6f47432507e84b7e76b88a2c27fff8e6e2e4.

Transaction commit should not be allowed to fail.
This is a fatal error.

Michael

10 years agoRevert "allow the transaction commit to fail"
Michael Adam [Thu, 10 Sep 2009 14:20:26 +0000 (16:20 +0200)]
Revert "allow the transaction commit to fail"

This reverts commit 7a6134e684c9ac4763bf198ef1410867b6082c94.

Transaction commit should not be allowed to fail.
This is a fatal error.

Michael

10 years agoctdb_client: fix race in starting concurrent transactions on a single node
Michael Adam [Tue, 4 Aug 2009 07:45:50 +0000 (09:45 +0200)]
ctdb_client: fix race in starting concurrent transactions on a single node

There are two races in concurrent transactions on a single node.
One in starting a transaction, and one with committing (replaying).

This commit closes the first race by storing the pid in the
transaction-lock record and comparing the own pid against it
as a measure to prevent starting a second transaction when
a second node has come inbetween and changed the pid in the lock
record.

Michael