git.samba.org - ctdb.git/log

git.samba.org / ctdb.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Michael Adam [Wed, 28 Oct 2009 21:55:44 +0000 (22:55 +0100)]

client: fix race condition with concurrent transactions on the same node.

In ctdb_transaction_commit(), when the trans2_commit control fails, there
is a race condition in the 1 second sleep between the local transaction_cancel
and the call to ctdb_replay_transaction(): The database is not locked, and
neither is the transaction_lock record. So another client can start and possibly
complete a new transaction in this gap, but only on the same node: The locking
of the transaction_lock record on a different node which involves migration of
the record to the other node has been disabled by introduction of the
transaction_active flag on the db which closes precisely this gap from the start
of the commit until the call to TRANS2_FINISH or TRANS2_ERROR.
But this mechanism does not cover the case where a process on the same node
tries to start a transaction: There is no obstacle to locking the transaction_lock
record because the record does not need to be migrated.

This commit closes this race condition in ctdb_transaction_fetch_start()
by using the new ctdb_ctrl_transaction_active() call to ask the local
ctdb daemon whether it has a transaction running on the database.
If so, the check is repeated until the running transaction is done.

This does introduce an additional call to the local ctdbd when starting
transactions, but it does close the (hopefully) last race condition.

Michael

commit | commitdiff | tree

Michael Adam [Wed, 28 Oct 2009 21:54:49 +0000 (22:54 +0100)]

client: add ctdb_ctrl_transaction_active() which calls out to CTDB_TRANS2_ACTIVE

Michael

commit | commitdiff | tree

Michael Adam [Wed, 28 Oct 2009 21:50:05 +0000 (22:50 +0100)]

server: add a new ctdb control CTDB_TRANS2_ACTIVE

This aske the daemon wheter a transaction is currently active on a
given DB on that node. More precisely this asks for the transaction_active
flag in the ctdb_db_context that is set in the CTDB_TRANS2_COMMIT
control and cleared in the CTDB_TRANS2_ERROR or CTDB_TRANS2_FINISHED controls.

This will be useful for fixing race conditions in the transaction code.

Michael

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 28 Oct 2009 06:42:01 +0000 (17:42 +1100)]

version 1.0.101

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 28 Oct 2009 06:35:15 +0000 (17:35 +1100)]

create a separate context for non-monitor eventscripts so they dont collide

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 28 Oct 2009 05:40:31 +0000 (16:40 +1100)]

return 0 in the event script callback if it was aborted by a different script

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 28 Oct 2009 05:18:28 +0000 (16:18 +1100)]

new version 1.0.100

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 28 Oct 2009 05:11:54 +0000 (16:11 +1100)]

change the eventscript handling to allow EventScriptTimeout for each individual script isntead of for the entire set of scripts

restructure the talloc hierarchy to allow this

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 27 Oct 2009 22:07:43 +0000 (09:07 +1100)]

Enhance the logging fromeventscripts.
When a single script is finished, also log the name of the script, the duration it took and the return status.

In the loop where we signal back to the main daemon that the script finished, do this once every 100ms instead of once every 1 second

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 27 Oct 2009 04:45:03 +0000 (15:45 +1100)]

add a check that winbind can actually talk to teh dc during the startup event
and refuse to start up if it can not

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 27 Oct 2009 04:17:45 +0000 (15:17 +1100)]

temporarily try allowing clients to attach to databases even if the node is banned/stopped or inactive in any other way.

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 27 Oct 2009 02:51:45 +0000 (13:51 +1100)]

dont run the monitor event so frequently after a event has failed.

use _exit() instead of exit() when terminating an eventscript.

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 27 Oct 2009 02:18:52 +0000 (13:18 +1100)]

for debugging

add a global variable holding the pid of the main daemon.
change the tracking of time() in the event loop to only check/warn when called from the main daemon

commit | commitdiff | tree

Stefan Metzmacher [Tue, 6 Oct 2009 14:16:13 +0000 (16:16 +0200)]

ctdb_diagnostics: don't use hardcoded path to iptables

All event scripts use only the relative path, so we should
here.

Also PATH includes /sbin and /usr/sbin...

metze

commit | commitdiff | tree

Stefan Metzmacher [Fri, 9 Oct 2009 13:47:06 +0000 (15:47 +0200)]

ctdb_client: fix DEBUG statement in ctdb_ctrl_modflags()

metze

commit | commitdiff | tree

Stefan Metzmacher [Fri, 9 Oct 2009 13:47:49 +0000 (15:47 +0200)]

server: if takeover runs when the recovery master becomes unhealthy

The problem was this:

When the monitor event fails, the node->flags get updated,
and an update (containing the old and new flags) is sent to
the recovery master.

If the recovery master sends the update to itself (the same process),
it was compairing the node->flags variable with the received new flags.
This check always found both flag values to be equal
and never sets the rec->need_takeover_run variable to true.

There were two problem, first the push_flags_handler() function
didn't pass the received old flags.

And the ctdb_control_modflags() function ignored the received old flags.

metze

commit | commitdiff | tree

Stefan Metzmacher [Fri, 9 Oct 2009 13:50:59 +0000 (15:50 +0200)]

server: print out the full 64-bit srvid on 32-bit hosts

metze

commit | commitdiff | tree

Stefan Metzmacher [Wed, 21 Oct 2009 15:06:48 +0000 (17:06 +0200)]

tcp: don't log an error when we succefully bind to the desired address

metze

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 26 Oct 2009 02:20:35 +0000 (13:20 +1100)]

patch the event loop so we read the current time every iteration.

log an error if the clock jumps backwards
also log an error if the clock jumps >5 seconds forward (we assume here we will get at least one event every 5 seconds)

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 26 Oct 2009 01:20:52 +0000 (12:20 +1100)]

Suggestion from Volker,

make ctdb_queue_length() cheaper by using a counter variable instead of counting the number of packets each time.

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 25 Oct 2009 23:22:00 +0000 (10:22 +1100)]

disabel the multipath eventscript by default

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 25 Oct 2009 23:11:00 +0000 (10:11 +1100)]

update the manpage for ctdb setreclock

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 25 Oct 2009 23:13:20 +0000 (10:13 +1100)]

automatically re-activate the reclock file check if we set the reclock file to something

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 25 Oct 2009 22:35:18 +0000 (09:35 +1100)]

lower the log level of a debug message

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 23 Oct 2009 04:24:51 +0000 (15:24 +1100)]

Add a mechanism where we can register notifications to be sent out to a SRVID when the client disconnects.

The way to use this is from a client to :
1, first create a message handle and bind it to a SRVID
   A special prefix for the srvid space has been set aside for samba :
   Only samba is allowed to use srvid's with the top 32 bits set like this.
   The lower 32 bits are for samba to use internally.

2, register a "notification" using the new control :
                    CTDB_CONTROL_REGISTER_NOTIFY         = 114,
   This control takes as indata a structure like this :
struct ctdb_client_notify_register {
        uint64_t srvid;
        uint32_t len;
        uint8_t notify_data[1];
};

srvid is the srvid used in the space set aside above.
len and notify_data is an arbitrary blob.
When notifications are later sent out to all clients, this is the payload of that notification message.

If a client has registered with control 114 and then disconnects from ctdbd, ctdbd will broadcast a message to that srvid to all nodes/listeners in the cluster.

A client can resister itself with as many different srvid's it want, but this is handled through a linked list from the client structure so it mainly designed for "few notifications per client".

3, a client that no longer wants to have a notification set up can deregister using control
                    CTDB_CONTROL_DEREGISTER_NOTIFY       = 115,
which takes this as arguments :
struct ctdb_client_notify_deregister {
        uint64_t srvid;
};

When a client deregisters, there will no longer be sent a message to all other clients when this client disconnects from ctdbd.

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 23 Oct 2009 02:55:21 +0000 (13:55 +1100)]

when scripts timeout, log pstree to a file in /tmp and just log the filename in the messages file

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 23 Oct 2009 02:54:45 +0000 (13:54 +1100)]

set the eventscripts to timeout after 20 seconds
change the ban count to 10 failures before we ban by default

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 22 Oct 2009 23:43:13 +0000 (10:43 +1100)]

Merge commit 'martins/master'

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 22 Oct 2009 07:16:33 +0000 (18:16 +1100)]

new version 1.0.99

commit | commitdiff | tree

Martin Schwenke [Thu, 22 Oct 2009 06:48:09 +0000 (17:48 +1100)]

Merge commit 'origin/master'

commit | commitdiff | tree

Martin Schwenke [Thu, 22 Oct 2009 06:47:10 +0000 (17:47 +1100)]

Document onnode -n and -f options.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 22 Oct 2009 02:41:28 +0000 (13:41 +1100)]

if a lock wait child died/finished, we could have released the lockwait handle and set it to NULL before we call the destructors for releaseing the waiters.

The waiters reference the locakwait handle in order to remove itself from the li
nked list which caused a SEGV.

We dont actually need to remove ourselves from this list here since
if the parent freeze_handle holding the list is freed, then all waiters are rele
ased as well, and the only place we actually need to relink the waiter is in ctd
b_freeze_lock_handler, where we want to respond back to the clients and release
the waiters but we still want to keep the freeze_handle hanging around.

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 22 Oct 2009 01:19:40 +0000 (12:19 +1100)]

From Volker L
Fix some warnings and an incorrect check for a talloc failure

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 20:58:44 +0000 (07:58 +1100)]

From Wolfgang M.

With the new vacuuming code, dont treat an invalid dmaster as fatal. Let it update to the new value insetad.

commit | commitdiff | tree

Martin Schwenke [Wed, 21 Oct 2009 10:48:15 +0000 (21:48 +1100)]

Merge commit 'origin/master'

commit | commitdiff | tree

Martin Schwenke [Wed, 21 Oct 2009 10:47:06 +0000 (21:47 +1100)]

Test suite: Remove the disable/enable monitor tests - they are useless.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Wed, 21 Oct 2009 10:36:39 +0000 (21:36 +1100)]

Test suite: Fix the timeouts on the skip share check tests.

The timeout for waiting for state changes isn't very predictable.  It
is "about" MonitorInterval seconds...  but can be longer given the
duration of eventscript runs and other things.  So, we change the
timeout to MonitorInterval + EventScriptTimeout, hoping it never takes
that long.

Move the eventscript installation/removal from the old fake-tests into
a function in the functions file.  Implement supporting functions to
create/remove/check-for various files that it handles.  Also add a
function that uses all of this that waits for the next monitor event
(but only if all other monitor events pass).

The final check in the skip share check tests uses the above and waits
for a monitor event, and then checks that the node is still healthy.

Also enhance the wait_until function to handle a command starting with
'!' (as a separate word) to make it easy to wait for a file not to
exist.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 05:50:39 +0000 (16:50 +1100)]

During tests it is common to add/delete test eventscripts at runtime.
This can race with teh eventascript handling that does a :

list all scripts, sort them, then execute them

so trap status code 127 which means the script could not be executed (or /bin/sh does not exist) and treat it as not to cause the node to become unhealthy

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 04:26:24 +0000 (15:26 +1100)]

lower the debug levels for the "create FD messages" so we dont fill up the logs.

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 04:20:55 +0000 (15:20 +1100)]

When clients have blocked, perhaps because the node is banned or stopped and the client is blocked trying to tdb_fetch() a record, make sure we dont queue up too many REQ_MESSAGES.

    Add a new tunable to control the maximum queue size we allow to a blocked client before we start discarding REQ_MESSAGES instead of queueing them for delivery.

    This avoids having queued up very very large number of MESSAGES that samba semds
     between eachother to nodes that are blocked/banned/stopped for extended periods
    .

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 02:54:02 +0000 (13:54 +1100)]

dont restart ctdb when installing the rpm

commit | commitdiff | tree

Michael Adam [Tue, 20 Oct 2009 14:57:23 +0000 (16:57 +0200)]

In ctdb_ltdb_store(), add a missing transaction_cancel when local store failed.

Spotted by Volker.

Michael

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 00:51:30 +0000 (11:51 +1100)]

mprove the log message when we skip the ip allocation check from the recovery daemon.

we also skip this check if we are already in the process of performing an ip reallocation and not only when we are performing a full recovery.

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 21 Oct 2009 00:34:17 +0000 (11:34 +1100)]

treat interfaces with the name ethX* as bond devices

commit | commitdiff | tree

Martin Schwenke [Tue, 20 Oct 2009 06:11:01 +0000 (17:11 +1100)]

Test suite: A timeout of MonitorInterval seconds sometimes isn't enough.

Monitor events sometimes happen a little bit more than MonitorInterval
seconds apart. This changes some timeouts to MonitorInterval + 1
seconds.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Tue, 20 Oct 2009 05:53:04 +0000 (16:53 +1100)]

Merge commit 'origin/master'

commit | commitdiff | tree

Martin Schwenke [Tue, 20 Oct 2009 05:52:22 +0000 (16:52 +1100)]

Test suite: New tests for validating SKIP_SHARE_CHECK options.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Tue, 20 Oct 2009 05:51:06 +0000 (16:51 +1100)]

Test suite: Update 99_ctdb_uninstall_eventscript.sh to use ctdb_init().

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Tue, 20 Oct 2009 05:45:29 +0000 (16:45 +1100)]

Test suite: Fix bug in node_has_status().

This function has been broken since it was updated to work with the
"stopped" state (probably commit
67c5bfb5f02c9d45a32d976021ede4fb2174dfe9). Although ${var#:*:0}
removes the shortest matching prefix of $var, '*' can match substrings
that include ':' if '0' isn't where you expect. So we were making
unexpected matches and incorrectly returning true for some cases.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Tue, 20 Oct 2009 05:44:44 +0000 (16:44 +1100)]

Test suite: add -x option to ctdb_init() function.

This facilitates tracing of tests.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 20 Oct 2009 04:36:35 +0000 (15:36 +1100)]

version 1.0.98

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 20 Oct 2009 02:01:15 +0000 (13:01 +1100)]

From Wolfgang Mueller

make sure to always create the vactun database and get rid of some annoying log messages

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 20 Oct 2009 01:59:48 +0000 (12:59 +1100)]

From wolfgang Mueller

Add a tuneable so that when scripts starts to hang/timeout, we can make the node unhealthy instead of banned

commit | commitdiff | tree

Martin Schwenke [Mon, 19 Oct 2009 05:46:45 +0000 (16:46 +1100)]

Merge commit 'origin/master'

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 19 Oct 2009 05:22:15 +0000 (16:22 +1100)]

add a direcotry where multiple local scripts can be added to run when executing eventscripts

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 19 Oct 2009 04:33:20 +0000 (15:33 +1100)]

wait a bit longer before shutting down when the reclock file is missing

pring the filename of the missing file when we turn unhealthy and also
a 'df'

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 19 Oct 2009 04:30:44 +0000 (15:30 +1100)]

Revert "dont shutdown a node when the reclock file is temporarily unavailable."

This reverts commit f5e9f3007c10a937158bc8cdfabf33c984cf9c50.

commit | commitdiff | tree

Martin Schwenke [Fri, 16 Oct 2009 05:39:46 +0000 (16:39 +1100)]

Merge branch 'onnode_options'

commit | commitdiff | tree

Martin Schwenke [Fri, 16 Oct 2009 05:36:48 +0000 (16:36 +1100)]

Merge commit 'origin/master'

commit | commitdiff | tree

Martin Schwenke [Fri, 16 Oct 2009 05:35:56 +0000 (16:35 +1100)]

initscript: when stopping on Red Hat use the success/failure functions.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 15 Oct 2009 05:03:43 +0000 (16:03 +1100)]

Dont run eventscript monitor when the databases are frozen.
The databases can become frozen a while before we do the actual recovery
since we have the re-recovery timeout.

There is no point in doing much monitoring if we are waiting for a recovery,
or if we are banned.
This will eliminate some annoying log entries where certain tests will fail if the databases are locked.

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 15 Oct 2009 02:19:10 +0000 (13:19 +1100)]

dont shutdown a node when the reclock file is temporarily unavailable.
Leave the node as UNHEALTHY this stops clients from accessing the node until
the reclock file can be accessed again

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 15 Oct 2009 00:24:54 +0000 (11:24 +1100)]

add logging everytime we create a filedescriptor in the main ctdb daemon
so we can spot if there are leaks.

plug two leaks for filedescriptors related to when sending ARP fail
and one leak when we can not parse the local address during tcp connection establish

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 14 Oct 2009 20:41:56 +0000 (07:41 +1100)]

new version 1.0.97

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 14 Oct 2009 04:51:57 +0000 (15:51 +1100)]

Merge commit 'martins/onnode_options'

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 14 Oct 2009 03:52:24 +0000 (14:52 +1100)]

version 1.0.96

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 14 Oct 2009 03:14:28 +0000 (14:14 +1100)]

add more debugging output to eventscripts and when a script has timed out,
print a full "pstree -p" to the log.

Example :
|-ctdbd(29826)-+-ctdbd(29862)
| `-ctdbd(31897)-+-00.ctdb(31898)---sleep(31908)

change the default timeout to 60 seconds for eventscripts

commit | commitdiff | tree

Martin Schwenke [Wed, 14 Oct 2009 02:49:30 +0000 (13:49 +1100)]

Merge commit 'origin/master' into onnode_options

commit | commitdiff | tree

Martin Schwenke [Wed, 14 Oct 2009 02:44:57 +0000 (13:44 +1100)]

New onnode options: -f to specify nodes file, -n to allow use of hostnames.

The -f option allows an alternate nodes file to be specified,
overriding the CTDB_NODES_FILE environment variable.

The -n option allows hostnames to be used instead of node numbers.
Using a range of hostnames is invalid, so hostnames can't contain
hyphens ('-') - sorry! You can use this option without a nodes file
by specifying "-f /dev/null".

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 14 Oct 2009 01:12:04 +0000 (12:12 +1100)]

move the logging of the warning "No reclock file used" to the startup case so we only print this warning on "service ctdb start" and not for "service ctdb *"

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 14 Oct 2009 00:59:16 +0000 (11:59 +1100)]

when we change state between healthy/unhealthy, make sure we ask the recovery
master to perform an explicit ip reallocation.

This is more reliable and faster than having the recovery dameon track these
changes, and since we now have an explicit method to ask the recovery daemon
to perform an explicit ip reallocation, we should use this.

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 13 Oct 2009 23:14:03 +0000 (10:14 +1100)]

allow a pre .95 version of a recovery master to freeze databases on a post .95 node by remapping priority numbers and log this to log.ctdb

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 13 Oct 2009 22:15:24 +0000 (09:15 +1100)]

always create the nfs state directories during the monitor event.
this allows us to configure and enable nfs at runtime without having to restart ctdbd

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 13 Oct 2009 21:17:49 +0000 (08:17 +1100)]

Port Volkers deadlock avoidance patch to HEAD.
This patch ensures that we lock all non-notify related databases first and
then the notify databases to avoiud a deadlock where samba needs to lock records on two databases at once (and notify being the second database).

Newer versions of samba would instead use the set-db-prio control to set this explicitely on a database per database basis instead of relying on hardcoded database names. This patch will be reverted in the future when all updated versions of samba has been pushed out.

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 22:49:05 +0000 (09:49 +1100)]

we must break the loop as soon as we find a suitable recmaster does exist
otherwise "tdb ipreallocate" will silently fail to update the addresses.

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 07:53:20 +0000 (18:53 +1100)]

new version 1.0.95

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 07:41:57 +0000 (18:41 +1100)]

use the correct expected size for thew _cancel control

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 07:31:59 +0000 (18:31 +1100)]

add a dispatch to the recovery transaction cancel call

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 05:51:36 +0000 (16:51 +1100)]

Merge commit 'martins/master'

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 05:48:05 +0000 (16:48 +1100)]

add a new control for explicitely cancelling recovery transactions, i.e. the
transactions we start across all tdb databased during the recovery.

this allows us to properly clean up and delete these tdb transactions on a
recovery failure.

commit | commitdiff | tree

Martin Schwenke [Mon, 12 Oct 2009 05:32:49 +0000 (16:32 +1100)]

Clean up ctdb_check_directories* eventscript functions.

There are 2 problems with this code:

* The loop in ctdb_check_directories_probe() breaks on filenames
  containing whitespace.

  The fix to protect them is to pass "$@" to this function and have it
  operate on "$@".

  Note that there's still a problem with whitespace in filenames in
  the 50.samba eventscript.  To fix this ctdb_check_directories_probe
  should read the filenames from stdin.  Another time...

* The check for '%' in filenames in ctdb_check_directories_probe()
  ends up involving several forks.  On a modern machine this can cost
  a couple of minutes when checking a large number of directories.

  The fix is to use a case statement.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Mon, 12 Oct 2009 05:17:37 +0000 (16:17 +1100)]

40.vsftpd: reset the fail counter in the "recovered" event.

Each recovery that involves IP reassignments results in a restart of
vsftpd in the "recovered" event. Currently, we can have several
recoveries in quick succession and the "monitor" event following each
can fail because vsftpd isn't ready yet. This results in cumulative
failures, so the node is marked unhealthy, even though vsftpd has
never had a proper opportunity to become ready.

This resets the fail count after each recovery.

While we're here, also move the delete of the restart flag file into
the body of the conditional.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 02:06:16 +0000 (13:06 +1100)]

allow setting the recmode even when not completely frozen.
we sometimes have to do this when we want to trigger a recovery

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 12 Oct 2009 01:08:39 +0000 (12:08 +1100)]

initial attempt at freezing databases in priority order

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 11 Oct 2009 22:22:17 +0000 (09:22 +1100)]

uptade the freeze/thaw commands to be able to send the requested database priority to freeze/thaw to the daemon.

this is encoded in the srvid field of the request header

commit | commitdiff | tree

Ronnie Sahlberg [Sat, 10 Oct 2009 05:28:20 +0000 (16:28 +1100)]

during recovery, update all remote nodes so they use the same priorities
for the databases as this node.

commit | commitdiff | tree

Ronnie Sahlberg [Sat, 10 Oct 2009 04:04:18 +0000 (15:04 +1100)]

add a control to read the db priority from a database

commit | commitdiff | tree

Ronnie Sahlberg [Sat, 10 Oct 2009 03:26:09 +0000 (14:26 +1100)]

add a control to set a database priority. Let newly created databases default to priority 1.

database priorities will be used to control in which order databases are locked during recovery in.

commit | commitdiff | tree

Ronnie Sahlberg [Sat, 10 Oct 2009 02:55:11 +0000 (13:55 +1100)]

verify the DISABLED flag and compare with the previous flag we have registered for that node and not what the node says is the difference.

this prevents a situation where the remove node may cause spurious ip reallocations.

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 9 Oct 2009 11:22:11 +0000 (22:22 +1100)]

Fix bug spotted by Metze,

the argument to ctdb_control_event_Script_disabled() is a string not a uint32

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 8 Oct 2009 08:17:57 +0000 (19:17 +1100)]

version 1.0.94

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 8 Oct 2009 05:45:25 +0000 (16:45 +1100)]

if a node fails to become frozen during recovery, mark it up with as a culprit so it will soon get banned

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 6 Oct 2009 06:05:14 +0000 (17:05 +1100)]

version 1.0.93

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 6 Oct 2009 05:09:24 +0000 (16:09 +1100)]

update natgw eventscript to allow you to fore it to update and / or to remove the configuration at runtime

commit | commitdiff | tree

Martin Schwenke [Tue, 6 Oct 2009 02:39:31 +0000 (13:39 +1100)]

Merge commit 'origin/master'

commit | commitdiff | tree

Martin Schwenke [Tue, 6 Oct 2009 02:38:00 +0000 (13:38 +1100)]

Document CTDB_NODES_FILE environment variable used by onnode.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 6 Oct 2009 01:25:44 +0000 (12:25 +1100)]

always send the release/take ip controls to make sure all nodes are updated

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 6 Oct 2009 01:11:32 +0000 (12:11 +1100)]

add a new message to ask the recovery daemon to temporarily disable checking ip address consistency.

This is useful when we are moving addresses using moveip in the cluster since otherwise if we collide with the recovery daemons own check we could cause a recovery

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 6 Oct 2009 00:41:18 +0000 (11:41 +1100)]

update addip/moveip/delip to make it less likely to trigger an accidental recovery

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 6 Oct 2009 00:40:38 +0000 (11:40 +1100)]

change some loglevels and also pront the pnn of the ip for takeip/releaseip logging

CTDB repository