apache's service name is not always httpd Solution 2 of <https://bugzilla.samba.org/show_bug.cgi?id=8317>
Eventscripts - enhance ctdb_replay_monitor_status() Print useful output and return a suitable exit code. The DISABLED and TIMEDOUT statuses use fake negative return codes, and these can't be faked from the shell. So we map DISABLED to OK and TIMEDOUT to ERROR - this should avoid nearly all surprises. When we do this we add a note to the beginning of the output. The alternative is to "fix" ctdbd to use only codes that can actually be returned by shell scripts. However, the reason for using negative codes is probably to distinguish them from real ones... Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - use ctdb scriptstatus -Y when replaying status Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts: add a synchronous synthetic reconfigure event. In the current code services can only be reconfigured asynchronously. This means that configuration file changes can be made, an asychronous reconfigure event can be triggered, and it always succeeds. Some time later when a service is actually reconfigured then a failure may be seen This adds a synthetic reconfigure event that reconfigures a service synchronously so that any failure is reported on exit. ctdb_service_check_reconfigure() is essentially reimplemented. If a reconfigure event is in flight and an ipreallocated or monitor event occurs then any scheduled asynchronous reconfigure is deferred until the next monitor cycle. This is to avoid reconfigures trampling on each other. In this case a monitor event will also replay the previous status to try to avoid exposing any temporary instability. If a reconfigure event collides with another reconfigure event it will exit with status 2, indicating that the reconfigure should be retried. The reconfigure event is implemented using a subprocess to control the exit from the synthetic event. As before, if a monitor event causes a scheduled synchronous reconfigure to occure then it will replay the previous status for the service, given that a reconfigure can cause temporary instability. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - new function ctdb_check_args() Pass this "$@" to do common eventscript argument checking. For regular use putting this in 00.ctdb would be enough. However, for developer testing it can be useful to call this in other eventscripts. For example, 10.interfaces and 13.per_ip_routing currently check these by hand. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - ctdb_check_tcp_ports() bug fix. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - fix debugging buglet in ctdb_check_tcp_ports_ctdb() Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts: New configuration variable CTDB_SERVICE_AUTOSTARTSTOP. Some of the current auto-start/stop logic is broken, particularly for Samba. Fixing it is non-trivial. If $CTDB_SERVICE_AUTOSTARTSTOP is "yes" then auto-start/stop services when told to newly manage or no longer manage them. This defaults to "yes". However, if using a canned configuration file that doesn't set $CTDB_SERVICE_AUTOSTARTSTOP then this stops the auto-start-stop logic from working. Therefore, this works around CQ S1026685 - on the system in question another daemon controls service auto-start/stop and CTDB just gets in the way. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - new default TCP port checker using "ctdb checktcpport" New function ctdb_check_tcp_ports_ctdb(). This should be fast... and is now the default checker. If it fails in an unexpected way we fall back to the nmap and netstat checkers. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - generalise TCP port checking plus new nmap-based checker Split the netstat-specific parts of ctdb_check_tcp_ports() into new function ctdb_check_tcp_ports_netstat(). Implement new ctdb_check_tcp_ports_nmap() function that uses "nmap -PS" to check if the desired ports are listening. ctdb_check_ctdb_ports() now uses new configuration variable CTDB_TCP_PORT_CHECKERS to decide which port checkers to try. Default value is currently "nmap netstat". If nmap is not found then this will fall back to netstat - if logging is at debug level this will also fill the logs with message saying the nmap checker failed. This indicates that either nmap should be installed or the default value of CTDB_TCP_PORT_CHECKERS should be changed (in a configuration file) to avoid trying to use nmap. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - ctdb_check_tcp_ports() only prints netstat output if debugging Use the new debug function to conditionally print the netstat output. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - weaken TCP port check message if CTDB has just been started. Sometimes smbd and other services can take a while to start, especially when there is a lot of activity after ctdbd has just started. The TCP port check can then pollute the logs with lots of "ERROR" messages and possibly extra debug. This creates a flag file when a service is started (but not restarted) and this flag is removed the first time that TCP port checks succeed for that service. When a port check fails and the flag file still exists, a less extreme "INFO" message is printed rather than the usual "ERROR" message. This means that until the node actually becomes healthy we see more friendly messages. The subtext is that we're hearing false positive reports "recreates" of CQ S1024874 (samba stopped responding on port 445) quite often when ctdbd is started. This reduces the chances of people reporting such false recreates... Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscript functions: optimise ctdb_check_tcp_ports() and add debug. ctdb_check_tcp_ports() runs "netstat -a -t -n" in a loop for each port. There are 2 problems with this: * Netstat is run on each loop iteration when it need only be run once. * The -a option is used to list all connections but the function only cares about the listening ports. There may be many thousands of non-listening ports to grep through. This changes ctdb_check_tcp_ports() to run netstat with the -l option instead of the -a option. It also only runs netstat once before the main loop. When a port is found to not be listening the output of the netstat command is now dumped to help with debugging. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts: add a debug() function and call ctdb_set_current_debuglevel() The debug function passes its arguments to echo if $CTDB_CURRENT_DEBUGLEVEL is >= 4 (i.e. DEBUG). If no args are given then use stdin - this allows the function to be used with here documents. To ensure $CTDB_CURRENT_DEBUGLEVEL is set, ctdb_set_current_debuglevel() is called near the end of the functions file. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts - new function ctdb_set_current_debuglevel() This function ensures that CTDB_CURRENT_DEBUGLEVEL is set. It works like this: 1. If it is already set then do nothing, since it might have been set some other way. The recommended "other way" would be to add a file in rc.local.d/. 2. If it is not set then set it by sourcing /var/ctdb/eventscript_debuglevel. 3. If this file does not exist then create it using output from "ctdb getdebug". If the optional 1st argument is set to "create" then don't source an existing file but create a new one instead - this is useful for creating the file just once in each event run in, say, 00.ctdb. If there's a problem getting the debug level from ctdb then it is silently set to 0 - no use spamming logs if our debug code is broken... Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts: In 60.nfs don't restart NFS when restarting rpc.lockd. This effectively reverts 953dbfbddad656a64e30a6aca115cb1479d11573 and is a policy decision. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts: clean up 60.nfs monitor event. This adds a helper function called nfs_check_rpc_service() and uses it to make the monitor event much more readable. An example of usage is as follows: nfs_check_rpc_service "mountd" \ -ge 10 "verbose restart:b unhealthy" \ -eq 5 "restart:b" The first argument to nfs_check_rpc_service() is the name of the RPC service to be checked. The RPC service corresponding to this command is checked for availability using the rpcinfo command. If the service is available then the function succeeds and subsequent arguments are ignored. If the rpcinfo check fails then a failure counter for that particular RPC service is incremented and subsequent arguments are processed in groups of 3: 1. An integer comparison operator supported by test. 2. An integer failure limit. 3. An action string. The value of the failure counter is checked using (1) and (2) above. The first check that succeeds has its action string processed - note that this explains the somewhat curious reverse ordering of checks. It the example above: * If the counter is >= 10 then a verbose message is printed describing the failure, the service is restarted in the background and the node is marked as unhealthy (via an "exit 1" from the function). * If the counter is == 5 then the service us restarted in the background. For more action options please see the code. This also changes the ctdb_check_rpc() function so that it no longer takes a program number to check. It now just takes a real RPC program name that rpcinfo can resolve via /etc/rpc. Signed-off-by: Martin Schwenke <martin@meltin.net>
Evenscripts: improvements to ctdb_service_check_reconfigure(). * Make this function applicable to "ipreallocated" event too. * Monitor event should not always succeed just because we reconfigure. If the service was unhealthy before the reconfigure and we end the reconfigure with "exit 0" then we can cause the node's health status to flip-flop. To avoid this we return the status of the service from the previous monitor event. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscript functions: new function ctdb_check_counter(). This should eventually be able to replace ctdb_check_counter_limit() and ctdb_check_counter_equal(), although it doesn't issue warnings like the former. It takes 4 optional arguments: 1. _msg - If "error" then over limit causes an error message and and exit 1. Anything else fails silently but the function returns 1. Default is "error". 2. _op - An integer operator supported by test (e.g. -eq, -ge, -gt). Default is -ge. 3. _limit - Limit for the counter to be used in comparison. Default is $service_fail_limit. 4. _service_name - Used to identify the counter. Default is $service_name. For example: ctdb_check_counter error -ge 5 foo will print a message and exit 1 if the counter for foo is >= 5, whereas ctdb_check_counter check -ge 5 foo will just return 1 if the counter for foo is >= 5, and ctdb_counter_check with print a message and exit 1 if the counter for $service_name is >= $service_fail_limit. Signed-off-by: Martin Schwenke <martin@meltin.net>
Eventscripts: remove unused remove_ip() function. Signed-off-by: Martin Schwenke <martin@meltin.net>