A Troubleshooting #
Strange problems may occur that are not easy to understand, especially when starting to experiment with High Availability. However, there are several utilities that allow you to take a closer look at the High Availability internal processes. This chapter recommends solutions.
A1 Installation and first steps #
Troubleshooting difficulties when installing the packages or bringing the cluster online.
- Are the HA packages installed?
- The packages needed for configuring and managing a cluster are included in the - High Availabilityinstallation pattern, available with SUSE Linux Enterprise High Availability.- Check if SUSE Linux Enterprise High Availability is installed on each of the cluster nodes and if the pattern is installed on each of the machines as described in the Installation and Setup Quick Start. 
- Is the initial configuration the same for all cluster nodes?
- To communicate with each other, all nodes belonging to the same cluster need to use the same - bindnetaddr,- mcastaddrand- mcastportas described in Chapter 4, Using the YaST cluster module.- Check if the communication channels and options configured in - /etc/corosync/corosync.confare the same for all cluster nodes.- In case you use encrypted communication, check if the - /etc/corosync/authkeyfile is available on all cluster nodes.- All - corosync.confsettings except for- nodeidmust be the same;- authkeyfiles on all nodes must be identical.
- Does the firewall allow communication via the
            mcastport?
- If the mcastport used for communication between the cluster nodes is blocked by the firewall, the nodes cannot see each other. When doing the initial setup with YaST or the bootstrap scripts (as described in Chapter 4, Using the YaST cluster module or the Article “Installation and Setup Quick Start”, respectively), the firewall settings are usually automatically adjusted. - To make sure the mcastport is not blocked by the firewall, check the firewall settings on each node. 
- Are Pacemaker and Corosync started on each cluster node?
- Usually, starting Pacemaker also starts the Corosync service. To check if both services are running: - #- crm cluster status- In case they are not running, start them by executing the following command: - #- crm cluster start
A2 Logging #
- Where to find the log files?
- Pacemaker writes its log files into the - /var/log/pacemakerdirectory. The main Pacemaker log file is- /var/log/pacemaker/pacemaker.log. In case you cannot find the log files, check the logging settings in- /etc/sysconfig/pacemaker, Pacemaker's own configuration file. If- PCMK_logfileis configured there, Pacemaker uses the path that is defined by this parameter.- If you need a cluster-wide report showing all relevant log files, see How can I create a report with an analysis of all my cluster nodes? for more information. 
- I enabled monitoring but there is no trace of monitoring operations in the log files?
- The - pacemaker-execddaemon does not log recurring monitor operations unless an error occurred. Logging all recurring operations would produce too much noise. Therefore recurring monitor operations are logged only once an hour.
- I only get a failedmessage. Is it possible to get more information?
- Add the - --verboseparameter to your commands. If you do that multiple times, the debug output becomes more verbose. See the logging data (- sudo journalctl -n) for useful hints.
- How can I get an overview of all my nodes and resources?
- Use the - crm_moncommand. The following displays the resource operation history (option- -o) and inactive resources (- -r):- #- crm_mon -o -r- The display is refreshed when the status changes (to cancel this press Ctrl–C). An example may look like: Example A1: Stopped resources #- Last updated: Fri Aug 15 10:42:08 2014 Last change: Fri Aug 15 10:32:19 2014 Stack: corosync Current DC: bob (175704619) - partition with quorum Version: 1.1.12-ad083a8 2 Nodes configured 3 Resources configured Online: [ alice bob ] Full list of resources: my_ipaddress (ocf:heartbeat:Dummy): Started bob my_filesystem (ocf:heartbeat:Dummy): Stopped my_webserver (ocf:heartbeat:Dummy): Stopped Operations: * Node bob: my_ipaddress: migration-threshold=3 + (14) start: rc=0 (ok) + (15) monitor: interval=10000ms rc=0 (ok) * Node alice:- The Pacemaker Explained PDF, available at https://www.clusterlabs.org/pacemaker/doc/, covers three different recovery types in the How are OCF Return Codes Interpreted? section. 
- How to view logs?
- For a more detailed view of what is happening in your cluster, use the following command: - #- crm history log [NODE]- Replace NODE with the node you want to examine, or leave it empty. See Section A5, “History” for further information. 
A3 Resources #
- How can I clean up my resources?
- Use the following commands: - #- crm resource list- #- crm resource cleanup rscid [node]- If you leave out the node, the resource is cleaned on all nodes. More information can be found in Section 8.5.2, “Cleaning up cluster resources with - crmsh”.
- How can I list my currently known resources?
- Use the command - crm resource listto display your current resources.
- I configured a resource, but it always fails. Why?
- To check an OCF script use - ocf-tester, for example:- ocf-tester -n ip1 -o ip=YOUR_IP_ADDRESS \ /usr/lib/ocf/resource.d/heartbeat/IPaddr - Use - -omultiple times for more parameters. The list of required and optional parameters can be obtained by running- crm ra info AGENT, for example:- #- crm ra info ocf:heartbeat:IPaddr- Before running ocf-tester, make sure the resource is not managed by the cluster. 
- Why do resources not fail over and why are there no errors?
- The terminated node might be considered unclean. Then it is necessary to fence it. If the STONITH resource is not operational or does not exist, the remaining node waits for the fencing to happen. The fencing timeouts are typically high, so it might take a while to see any obvious sign of problems (if ever). - Yet another possible explanation is that a resource is simply not allowed to run on this node. That may be because of a failure which happened in the past and which was not “cleaned”. Or it may be because of an earlier administrative action, that is a location constraint with a negative score. Such a location constraint is inserted by the - crm resource movecommand, for example.
- Why can I never tell where my resource will run?
- If there are no location constraints for a resource, its placement is subject to an (almost) random node choice. You are well advised to always express a preferred node for resources. That does not mean that you need to specify location preferences for all resources. One preference suffices for a set of related (colocated) resources. A node preference looks like this: - location rsc-prefers-alice rsc 100: alice 
A4 STONITH and fencing #
- Why does my STONITH resource not start?
- A start (or enable) operation includes checking the status of the device. If the device is not ready, the STONITH resource fails to start. - At the same time, the STONITH plugin is asked to produce a host list. If this list is empty, there is no point in running a STONITH resource which cannot shoot anything. The name of the host on which STONITH is running is filtered from the list, since the node cannot shoot itself. - To use single-host management devices such as lights-out devices, make sure that the STONITH resource is not allowed to run on the node which it is supposed to fence. Use an infinitely negative location node preference (constraint). The cluster will move the STONITH resource to another place where it can start, but not before informing you. 
- Why does fencing not happen, although I have the STONITH resource?
- Each STONITH resource must provide a host list. This list may be inserted by hand in the STONITH resource configuration or retrieved from the device itself from outlet names, for example. That depends on the nature of the STONITH plugin. - pacemaker-fenceduses the list to find out which STONITH resource can fence the target node. Only if the node appears in the list can the STONITH resource shoot (fence) the node.- If - pacemaker-fenceddoes not find the node in any of the host lists provided by running STONITH resources, it asks- pacemaker-fencedinstances on other nodes. If the target node does not show up in the host lists of other- pacemaker-fencedinstances, the fencing request ends in a timeout at the originating node.
- Why does my STONITH resource fail occasionally?
- Power management devices may give up if there is too much broadcast traffic. Space out the monitor operations. Given that fencing is necessary only occasionally (and hopefully never), checking the device status once every few hours is more than enough. - Also, some of these devices may refuse to talk to more than one party at the same time. This may be a problem if you keep a terminal or browser session open while the cluster tries to test the status. 
A5 History #
- How to retrieve status information or a log from a failed resource?
- Use the - historycommand and its subcommand- resource:- #- crm history resource NAME1- This gives you a full transition log for the given resource only. However, it is possible to investigate more than one resource. Append the resource names after the first. - If you followed naming conventions (see Appendix B, Naming conventions), the - resourcecommand makes it easier to investigate a group of resources. For example, this command investigates all primitives starting with- db:- #- crm history resource db*- View the log file in - /var/cache/crm/history/live/alice/ha-log.txt.
- How can I reduce the history output?
- There are two options for the - historycommand:- Use - exclude
- Use - timeframe
 - The - excludecommand let you set an additive regular expression that excludes certain patterns from the log. For example, the following command excludes all SSH,- systemd, and kernel messages:- #- crm history exclude ssh|systemd|kernel.- With the - timeframecommand you limit the output to a certain range. For example, the following command shows all the events on August 23 from 12:00 to 12:30:- #- crm history timeframe "Aug 23 12:00" "Aug 23 12:30"
- How can I store a “session” for later inspection?
- When you encounter a bug or an event that needs further examination, it is useful to store all the current settings. This file can be sent to support or viewed with - bzless. For example:- crm(live)history#- timeframe "Oct 13 15:00" "Oct 13 16:00"- crm(live)history#- session save tux-test- crm(live)history#- session packReport saved in '/root/tux-test.tar.bz2'
A6 Hawk2 #
- Replacing the self-signed certificate
- To avoid the warning about the self-signed certificate on first Hawk2 start-up, replace the automatically created certificate with your own certificate (or a certificate that was signed by an official Certificate Authority, CA): - Replace - /etc/hawk/hawk.keywith the private key.
- Replace - /etc/hawk/hawk.pemwith the certificate that Hawk2 should present.
- Restart the Hawk2 services to reload the new certificate: - #- systemctl restart hawk-backend hawk
 - Change ownership of the files to - root:haclientand make the files accessible to the group:- #- chown root:haclient /etc/hawk/hawk.key /etc/hawk/hawk.pem- #- chmod 640 /etc/hawk/hawk.key /etc/hawk/hawk.pem
A7 Miscellaneous #
- How can I run commands on all cluster nodes?
- Use the command - crm cluster runfor this task. For example:- #- crm cluster run "ls -l /etc/corosync/*.conf"INFO: [alice] -rw-r--r-- 1 root root 812 Oct 27 15:42 /etc/corosync/corosync.conf INFO: [bob] -rw-r--r-- 1 root root 812 Oct 27 15:42 /etc/corosync/corosync.conf INFO: [charlie] -rw-r--r-- 1 root root 812 Oct 27 15:42 /etc/corosync/corosync.conf- By default, the specified command runs on all nodes in the cluster. Alternatively, you can run the command on a specific node or group of nodes: - #- crm cluster run "ls -l /etc/corosync/*.conf" alice bob
- What is the state of my cluster?
- To check the current state of your cluster, use one of the programs - crm_monor- crm- status. This displays the current DC and all the nodes and resources known by the current node.
- Why can several nodes of my cluster not see each other?
- There could be several reasons: - Look first in the configuration file - /etc/corosync/corosync.conf. Check if the multicast or unicast address is the same for every node in the cluster (look in the- interfacesection with the key- mcastaddr).
- Check your firewall settings. 
- Check if your switch supports multicast or unicast addresses. 
- Check if the connection between your nodes is broken. Most often, this is the result of a badly configured firewall. This also may be the reason for a split-brain condition, where the cluster is partitioned. 
 
- Why can an OCFS2 device not be mounted?
- Check the log messages ( - sudo journalctl -n) for the following line:- Jan 12 09:58:55 alice pacemaker-execd: [3487]: info: RA output: [...] ERROR: Could not load ocfs2_stackglue Jan 12 16:04:22 alice modprobe: FATAL: Module ocfs2_stackglue not found. - In this case, the Kernel module - ocfs2_stackglue.kois missing. Install the package- ocfs2-kmp-default,- ocfs2-kmp-paeor- ocfs2-kmp-xen, depending on the installed Kernel.
A8 Cluster reports #
- How can I create a report with an analysis of all my cluster nodes?
- On the CRM Shell, use - crm reportto create a report. This tool compiles:- Cluster-wide log files, 
- Package states, 
- DLM/OCFS2 states, 
- System information, 
- CIB history, 
- Parsing of core dump reports, if a debuginfo package is installed. 
 - Usually run - crm reportwith the following command:- #- crm report -f 0:00 -n alice -n bob- The command extracts all information since 0am on the hosts alice and bob and creates a - *.tar.bz2archive named- crm_report-DATE.tar.bz2in the current directory, for example,- crm_report-Wed-03-Mar-2012. If you are only interested in a specific time frame, add the end time with the- -toption.Warning: Remove sensitive information- The - crm reporttool tries to remove any sensitive information from the CIB and the PE input files, however, it cannot know everything. If you have more sensitive information, supply additional patterns with the- -poption (see man page). The log files and the- crm_mon,- ccm_tool, and- crm_verifyoutput are not sanitized.- Before sharing your data in any way, check the archive and remove all information you do not want to expose. - Customize the command execution with further options. For example, if you have another user with permissions to the cluster (in addition to - rootand- hacluster), use the- -uoption and specify this user. If you have a non-standard SSH port, use the- -Xoption to add the port. If you need to simplify the arguments, set your default values in the configuration file- /etc/crm/crm.conf, section- report. For more information, see the man page of- crm report.Procedure A1: Generating a cluster report using a custom SSH port #- When using a custom SSH port, use the - -Xwith- crm reportto modify the client's SSH port. For example, if your custom SSH port is- 5022, use the following command:- #- crm report -X "-p 5022" [...]
- To set your custom SSH port permanently for - crm report, start the interactive CRM Shell:- #- crm options
- Enter the following: - crm(live)options#- set core.report_tool_options "-X -oPort=5022"
 - After - crm reporthas analyzed all the relevant log files and created the directory (or archive), check the log files for an uppercase- ERRORstring. The most important files in the top level directory of the report are:- analysis.txt
- Compares files that should be identical on all nodes. 
- corosync.txt
- Contains a copy of the Corosync configuration file. 
- crm_mon.txt
- Contains the output of the - crm_moncommand.
- description.txt
- Contains all cluster package versions on your nodes. There is also the - sysinfo.txtfile which is node specific. It is linked to the top directory.- This file can be used as a template to describe the issue you encountered and post it to https://github.com/ClusterLabs/crmsh/issues. 
- members.txt
- A list of all nodes 
- sysinfo.txt
- Contains a list of all relevant package names and their versions. Additionally, there is also a list of configuration files which are different from the original RPM package. 
 - Node-specific files are stored in a subdirectory named by the node's name. It contains a copy of the directory - /etcof the respective node.
A9 For more information #
For additional information about high availability on Linux, including configuring cluster resources and managing and customizing a High Availability cluster, see https://clusterlabs.org/wiki/Documentation.