This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.

Jump to content
Installing a Basic Two-Node High Availability Cluster
SUSE Linux Enterprise High Availability 16.0

Installing a Basic Two-Node High Availability Cluster

Publication Date: 31 Oct 2025
WHAT?

How to set up a basic two-node High Availability cluster with QDevice, diskless SBD and a software watchdog.

WHY?

This cluster can be used for testing purposes or as a minimal cluster configuration that can be extended later.

EFFORT

Setting up a basic High Availability cluster takes approximately 15 minutes, depending on the speed of your network connection.

GOAL

Get started with SUSE Linux Enterprise High Availability quickly and easily.

1 Usage scenario

This guide describes the setup of a minimal High Availability cluster with the following properties:

  • Two cluster nodes with passwordless SSH access to each other.

  • A floating, virtual IP address that allows clients to connect to the graphical management tool Hawk, no matter which node the service is running on.

  • Diskless SBD (STONITH Block Device) and a software watchdog used as the node fencing mechanism to avoid split-brain scenarios.

  • QDevice working with QNetd to participate in cluster quorum decisions. QDevice and QNetd are required for this setup so that diskless SBD can handle split-brain scenarios for the two-node cluster.

  • Failover of resources from one node to another if the active host breaks down (active/passive setup).

This is a simple cluster setup with minimal external requirements. You can use this cluster for testing purposes or as a basic cluster configuration that you can extend for a production environment later.

2 Installation overview

To install the High Availability cluster described in Section 1, “Usage scenario”, you must perform the following tasks:

  1. Review Section 3, “System requirements” to make sure you have everything you need.

  2. Install SUSE Linux Enterprise High Availability on the cluster nodes with Section 4, “Enabling the High Availability extension”.

  3. Install QNetd on a non-cluster server with Section 5, “Setting up the QNetd server”.

  4. Initialize the cluster on the first node with Section 6, “Setting up the first node”.

  5. Add more nodes to the cluster with Section 7, “Adding the second node”.

  6. Log in to the Hawk Web interface to monitor the cluster with Section 8, “Logging in to Hawk”.

  7. Check the status of QDevice and QNetd with Section 9, “Checking the QDevice and QNetd setup”.

  8. Perform basic tests to make sure the cluster works as expected with Section 10, “Testing the cluster”.

  9. Review Section 11, “Next steps” for advice on expanding the cluster for a production environment.

3 System requirements

This section describes the system requirements for a minimal setup of SUSE Linux Enterprise High Availability.

3.1 Hardware requirements

Servers

Three servers: two to act as cluster nodes, and one to run QNetd.

The servers can be bare metal or virtual machines. They do not require identical hardware (memory, disk space, etc.), but they must have the same architecture. Cross-platform clusters are not supported.

See the System Requirements section at https://www.suse.com/download/sle-ha/ for more details about server hardware.

Network Interface Cards (NICs)

At least two NICs per cluster node. This allows you to configure two or more communication channels for the cluster, using one of the following methods:

  • Combine the NICs into a network bond (preferred). In this case, you must set up the bonded device on each node before you initialize the cluster.

  • Create a second communication channel in Corosync. This can be configured by the cluster setup script. In this case, the two NICs must be on different subnets.

Node fencing

To be supported, all SUSE Linux Enterprise High Availability clusters must have at least one node fencing device configured. For critical workloads, we recommend using two or three fencing devices. A fencing device can be either a physical device (a power switch) or a software mechanism (SBD in combination with a watchdog).

The minimal setup described in this guide uses a software watchdog and diskless SBD, so no additional hardware is required. Before using this cluster in a production environment, replace the software watchdog with a hardware watchdog.

3.2 Software requirements

Operating system

All nodes and the QNetd server must have SUSE Linux Enterprise Server installed and registered.

High Availability extension

The SUSE Linux Enterprise High Availability extension requires an additional registration code.

This extension can be enabled during the SLES installation, or you can enable it later on a running system. This guide explains how to enable and register the extension on a running system.

3.3 Network requirements

Time synchronization

All systems must synchronize to an NTP server outside the cluster. SUSE Linux Enterprise Server uses chrony for NTP. When you initialize the cluster, you are warned if chrony is not running.

Even if the nodes are synchronized, log files and cluster reports can still become difficult to analyze if the nodes have different time zones configured.

Host name and IP address

All cluster nodes must be able to find each other, and the QNetd server, by name. Use the following methods for reliable name resolution:

  • Use static IP addresses.

  • List all nodes in the /etc/hosts file with their IP address, FQDN and short host name.

Only the primary IP address on each NIC is supported.

SSH

All cluster nodes must be able to access each other, and the QNetd server, via SSH. Certain cluster operations also require passwordless SSH authentication. When you initialize the cluster, the setup script checks for existing SSH keys and generates them if they do not exist.

Important
Important: root SSH access in SUSE Linux Enterprise 16

In SUSE Linux Enterprise 16, root SSH login with a password is disabled by default.

On each node, and the QNetd server, either create a user with sudo privileges or set up passwordless SSH authentication for the root user before you initialize the cluster.

If you initialize the cluster with a sudo user, certain crmsh commands also require passwordless sudo permission.

Separate network for QNetd

We recommend having the cluster nodes reach the QNetd server via a different network than the one Corosync uses. Ideally, the QNetd server should be in a separate rack from the cluster, or at least on a separate PSU and not in the same network segment as the Corosync communication channels.

4 Enabling the High Availability extension

This procedure explains how to install SUSE Linux Enterprise High Availability on an existing SUSE Linux Enterprise Server. You can skip this procedure if you already installed the High Availability extension and packages during the SLES installation with Agama.

Requirements
  • SUSE Linux Enterprise Server is installed and registered with the SUSE Customer Center.

  • You have an additional registration code for SUSE Linux Enterprise High Availability.

Perform this procedure on all the machines you intend to use as cluster nodes:

  1. Log in either as the root user or as a user with sudo privileges.

  2. Check whether the High Availability extension is already enabled:

    > sudo SUSEConnect --list-extensions
  3. Check whether the High Availability packages are already installed:

    > zypper search ha_sles
  4. Enable the SUSE Linux Enterprise High Availability extension:

    > sudo SUSEConnect -p sle-ha/16.0/x86_64 -r HA_REGCODE
  5. Install the High Availability packages:

    > sudo zypper install -t pattern ha_sles

5 Setting up the QNetd server

QNetd is an arbitrator that provides a vote to the QDevice service running on the cluster nodes. The QNetd server runs outside the cluster, so you cannot move cluster resources to this server. QNetd can support multiple clusters if each cluster has a unique name.

By default, QNetd runs the corosync-qnetd daemon as the user coroqnetd in the group coroqnetd. This avoids running the daemon as root.

Requirements
  • SUSE Linux Enterprise Server is installed and registered with the SUSE Customer Center.

  • You have an additional registration code for SUSE Linux Enterprise High Availability.

  • We recommend having the cluster nodes reach the QNetd server via a different network than the one Corosync uses.

Perform this procedure on a server that is not part of the cluster:

  1. Log in either as the root user or as a user with sudo privileges.

  2. Enable the SUSE Linux Enterprise High Availability extension:

    > sudo SUSEConnect -p sle-ha/16.0/x86_64 -r HA_REGCODE
  3. Install the corosync-qnetd package:

    > sudo zypper install corosync-qnetd

    You do not need to manually start the corosync-qnetd service. It starts automatically when you configure QDevice on the cluster.

The QNetd server is ready to accept connections from a QDevice client (corosync-qdevice). Further configuration is handled by crmsh when you connect QDevice clients.

6 Setting up the first node

SUSE Linux Enterprise High Availability includes setup scripts to simplify the installation of a cluster. To set up the cluster on the first node, use the crm cluster init script.

6.1 Overview of the crm cluster init script

The crm cluster init command starts a script that defines the basic parameters needed for cluster communication, resulting in a running one-node cluster.

The script checks and configures the following components:

NTP

Checks if chrony is configured to start at boot time. If not, a message appears.

SSH

Detects or generates SSH keys for passwordless login between cluster nodes.

Firewall

Opens the ports in the firewall that are needed for cluster communication.

Csync2

Configures Csync2 to replicate configuration files across all nodes in a cluster.

Corosync

Configures the cluster communication system.

SBD/watchdog

Checks if a watchdog exists and asks whether to configure SBD as the node fencing mechanism.

Hawk cluster administration

Enables the Hawk service and displays the URL for the Hawk Web interface.

Virtual floating IP

Asks whether to configure a virtual IP address for the Hawk Web interface.

QDevice/QNetd

Asks whether to configure QDevice and QNetd to participate in quorum decisions. This is recommended for clusters with an even number of nodes, and especially for two-node clusters.

Note
Note: Pacemaker default settings

The options set by the crm cluster init script might not be the same as the Pacemaker default settings. You can check which settings the script changed in /var/log/crmsh/crmsh.log. Any options set during the bootstrap process can be modified later with crmsh.

Note
Note: Cluster configuration for different platforms

The crm cluster init script detects the system environment (for example, Microsoft Azure) and adjusts certain cluster settings based on the profile for that environment. For more information, see the file /etc/crm/profiles.yml.

6.2 Initializing the cluster on the first node

Configure the cluster on the first node with the crm cluster init script. The script prompts you for basic information about the cluster and configures the required settings and services. For more information, run the crm cluster init --help command.

Requirements
  • SUSE Linux Enterprise High Availability is installed and up to date.

  • All nodes have at least two network interfaces or a network bond, with static IP addresses listed in the /etc/hosts file along with each node's FQDN and short host name.

  • The QNetd server is installed. If you log in to the QNetd server as the root user, passwordless SSH authentication must be enabled.

Perform this procedure on only one node:

  1. Log in to the first node either as the root user or as a user with sudo privileges.

  2. Start the crm cluster init script:

    > sudo crm cluster init

    The script checks whether chrony is running, opens the required firewall ports, configures Csync2, and checks for SSH keys. If no SSH keys are available, the script generates them.

  3. Configure Corosync for cluster communication:

    1. Enter an IP address for the first communication channel (ring0). By default, the script proposes the address of the first available network interface. This could be either an individual interface or a bonded device. Accept this address or enter a different one.

    2. If the script detects multiple network interfaces, it asks whether you want to configure a second communication channel (ring1). If you configured the first channel with a bonded device, you can decline with n. If you need to configure a second channel, confirm with y and enter the IP address of another network interface. The two interfaces must be on different subnets.

    The script configures the default firewall ports for Corosync communication.

  4. Choose whether to set up SBD as the node fencing mechanism:

    1. Confirm with y that you want to use SBD.

    2. When prompted for a path to a block device, enter none to configure diskless SBD.

    The script configures SBD, including the relevant timeout settings. Unlike disk-based SBD, diskless SBD does not require a fence_sbd cluster resource.

    If no hardware watchdog is available, the script configures the software watchdog softdog.

  5. Configure a virtual IP address for cluster administration with the Hawk Web interface:

    1. Confirm with y that you want to configure a virtual IP address.

    2. Enter an unused IP address to use as the administration IP for Hawk.

    Instead of logging in to Hawk on an individual cluster node, you can connect to the virtual IP address.

  6. Choose whether to configure QDevice and QNetd:

    1. Confirm with y that you want to configure QDevice and QNetd.

    2. Enter the IP address or host name of the QNetd server, with or without a user name.

      • If you include a non-root user name, you are prompted for the password, and the script configures passwordless SSH authentication from the node to the QNetd server.

      • If you omit a user name, the script defaults to the root user, so passwordless SSH authentication must already be configured for the node to access the QNetd server.

      For the remaining fields, accept the default values:

    3. Accept the proposed port (5403) or enter a different one.

    4. Choose the algorithm that determines how votes are assigned. The default is ffsplit.

    5. Choose the method to use when a tie-breaker is required. The default is lowest.

    6. Choose whether to enable TLS for client certificate checking. The default is on (attempt to connect with TLS, but connect without TLS if it is not available).

    7. Enter heuristics commands to assist in quorum calculation, or leave the field blank to skip this step.

    The script configures QDevice and QNetd, including SSH keys, the CA and server certificates, and the firewall port. It also enables the required services on the cluster nodes and on the QNetd server.

The script starts the cluster services to bring the cluster online and enable Hawk. The URL to use for Hawk is displayed on the screen. You can also check the status of the cluster with the crm status command.

Important
Important: Secure password for hacluster

The crm cluster init script creates a default cluster user and password. Replace the default password with a secure one as soon as possible:

> sudo passwd hacluster

7 Adding the second node

Add more nodes to the cluster with the crm cluster join script. The script only needs access to an existing cluster node and completes the basic setup on the current machine automatically. For more information, run the crm cluster join --help command.

Requirements
  • SUSE Linux Enterprise High Availability is installed and up to date.

  • An existing cluster is already running on at least one node.

  • All nodes have at least two network interfaces or a network bond, with static IP addresses listed in the /etc/hosts file along with each node's FQDN and short host name.

  • If you log in as a sudo user: The same user must exist on all nodes and the QNetd server. This user must have passwordless sudo permission.

  • If you log in as the root user: Passwordless SSH authentication must be configured on all nodes and the QNetd server.

Perform this procedure on each additional node:

  1. Log in to this node as the same user you set up the first node with.

  2. Start the crm cluster join script:

    • If you set up the first node as root, you can start the script with no additional parameters:

      # crm cluster join
    • If you set up the first node as a sudo user, you must specify that user with the -c option:

      > sudo crm cluster join -c USER@NODE1

    The script checks if chrony is running, opens the required firewall ports, and configures Csync2.

  3. If you did not already specify the first node with -c, you are prompted for its IP address or host name.

  4. If you did not already configure passwordless SSH authentication between the nodes, you are prompted for the password of the first node.

  5. Configure Corosync for cluster communication:

    1. The script proposes an IP address for ring0. This IP address must be on the same subnet as the IP address used for ring0 on the first node. If it is not, enter the correct IP address.

    2. If the cluster has two Corosync communication channels configured, the script prompts you for an IP address for ring1. This IP address must be on the same subnet as the IP address used for ring1 on the first node.

The script copies the cluster configuration from the first node, adjusts the timeout settings to consider the new node, and brings the new node online.

You can check the status of the cluster with the crm status command.

Important
Important: Secure password for hacluster

The crm cluster join script creates a default cluster user and password. On each node, replace the default password with a secure one as soon as possible:

> sudo passwd hacluster

8 Logging in to Hawk

Hawk allows you to monitor and administer a High Availability cluster using a graphical Web browser. You can also configure a virtual IP address that allows clients to connect to Hawk no matter which node it is running on.

Requirements
  • The client machine must be able to connect to the cluster nodes.

  • The client machine must have a graphical Web browser with JavaScript and cookies enabled.

You can perform this procedure on any machine that can connect to the cluster nodes:

  1. Start a Web browser and enter the following URL:

    https://HAWKSERVER:7630/

    Replace HAWKSERVER with the IP address or host name of a cluster node, or the Hawk virtual IP address if one is configured.

    Note
    Note: Certificate warning

    If a certificate warning appears when you access the URL for the first time, a self-signed certificate is in use. To verify the certificate, ask your cluster operator for the certificate details. To proceed anyway, you can add an exception in the browser to bypass the warning.

  2. On the Hawk login screen, enter the Username and Password of the hacluster user.

  3. Click Log In. The Hawk Web interface shows the Status screen by default.

The Status screen shows one configured resource: the virtual IP address admin-ip, running on a node called alice.
Figure 1: The Hawk Status screen

9 Checking the QDevice and QNetd setup

Use the crm corosync status command to check the cluster's quorum status and the status of QDevice and QNetd. You can run this command from any node in the cluster.

The following examples show a cluster with two nodes (alice and bob) and a QNetd server (charlie).

Example 1: Showing the cluster's quorum status
> sudo crm corosync status quorum
1 alice member
2 bob member

Quorum information
------------------
Date:             [...]
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          2
Ring ID:          1.e
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW alice
         2          1    A,V,NMW bob (local)
         0          1            Qdevice

The Membership information section shows the following status codes:

A (alive) or NA (not alive)

Shows the connectivity status between QDevice and Corosync.

V (vote) or NV (non vote)

Shows if the node has a vote. V means that both nodes can communicate with each other. In a split-brain scenario, one node would be set to V and the other node would be set to NV.

MW (master wins) or NMW (not master wins)

Shows if the master_wins flag is set. By default, the flag is not set, so the status is NMW.

NR (not registered)

Shows that the cluster is not using a quorum device.

Example 2: Showing the status of QDevice
> sudo crm corosync status qdevice
1 alice member
2 bob member

Qdevice information
-------------------
Model:                  Net
Node ID:                1
HB interval:            10000ms
Sync HB interval:       30000ms
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Heuristics:             Disabled
Ring ID:                1.e
Membership node list:   1, 2
Quorate:                Yes
Quorum node list:
    0   Node ID = 2, State = member
    1   Node ID = 1, State = member
Expected votes:         3
Last poll call:         [...]

Qdevice-net information
----------------------
Cluster name:           hacluster
QNetd host:             charlie:5403
Connect timeout:        8000ms
HB interval:            8000ms
VQ vote timer interval: 5000ms
TLS:                    Supported
Algorithm:              Fifty-Fifty split
Tie-breaker:            Node with lowest node ID
KAP Tie-breaker:        Enabled
Poll timer running:     Yes (cast vote)
State:                  Connected
TLS active:             Yes (client certificate sent)
Connected since:        [...]
Echo reply received:    [...]
Example 3: Showing the status of QNetd
> sudo crm corosync status qnetd
1 alice member
2 bob member

Cluster "hacluster":
    Algorithm:          Fifty-Fifty split (KAP Tie-breaker)
    Tie-breaker:        Node with lowest node ID
    Node ID 1:
        Client address:         ::ffff:192.168.1.185:45676
        HB interval:            8000ms
        Configured node list:   1, 2
        Ring ID:                1.e
        Membership node list:   1, 2
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 2:
        Client address:         ::ffff:192.168.1.168:55034
        HB interval:            8000ms
        Configured node list:   1, 2
        Ring ID:                1.e
        Membership node list:   1, 2
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   No change (ACK)

10 Testing the cluster

The following tests can help you identify basic issues with the cluster setup. However, realistic tests involve specific use cases and scenarios. Before using the cluster in a production environment, test it thoroughly according to your use cases.

10.1 Testing resource failover

Check whether the cluster moves resources to another node if the current node is set to standby. This procedure uses example nodes called alice and bob, and a virtual IP resource called admin-ip with the example IP address 192.168.1.10.

  1. Open two terminals.

  2. In the first terminal, ping the virtual IP address:

    > ping 192.168.1.10
  3. In the second terminal, log in to one of the cluster nodes.

  4. Check which node the virtual IP address is running on:

    > sudo crm status
    [..]
    Node List:
      * Online: [ alice bob ]
    
    Full List of Resources:
      * admin-ip  (ocf:heartbeat:IPaddr2):    Started alice
  5. Put alice into standby mode:

    > sudo crm node standby alice
  6. Check the cluster status again. The resource admin-ip should have migrated to bob:

    > sudo crm status
    [...]
    Node List:
      * Node alice: standby
      * Online: [ bob ]
    
    Full List of Resources:
      * admin-ip  (ocf:heartbeat:IPaddr2):    Started bob
  7. In the first terminal, you should see an uninterrupted flow of pings to the virtual IP address during the migration. This shows that the cluster setup and the floating IP address work correctly.

  8. Cancel the ping command with CtrlC.

  9. In the second terminal, bring alice back online:

    > sudo crm node online alice

10.2 Testing cluster failures

The crm cluster crash_test command simulates cluster failures and reports the results.

The command supports the following checks:

--fence-node NODE

Fences a specific node passed from the command line.

--kill-sbd/--kill-corosync/ --kill-pacemakerd

Kills the daemons for SBD, Corosync, or Pacemaker. After running one of these tests, you can find a report in the directory /var/lib/crmsh/crash_test/. The report includes a test case description, action logging, and an explanation of possible results.

--split-brain-iptables

Simulates a split-brain scenario by blocking the Corosync port, and checks whether one node can be fenced as expected. You must install iptables before you can run this test.

For more information, run the crm cluster crash_test --help command.

This example uses nodes called alice and bob, and tests fencing bob. To watch bob change status during the test, you can log in to Hawk and navigate to Status › Nodes, or run crm status from another node.

Example 4: Manually triggering node fencing
admin@alice> sudo crm cluster crash_test --fence-node bob

==============================================
Testcase:          Fence node bob
Fence action:      reboot
Fence timeout:     95

!!! WARNING WARNING WARNING !!!
THIS CASE MAY LEAD TO NODE BE FENCED.
TYPE Yes TO CONTINUE, OTHER INPUTS WILL CANCEL THIS CASE [Yes/No](No): Yes
INFO: Trying to fence node "bob"
INFO: Waiting 95s for node "bob" reboot...
INFO: Node "bob" will be fenced by "alice"!
INFO: Node "bob" was fenced by "alice" at DATE TIME

11 Next steps

This guide describes a basic High Availability cluster that can be used for testing purposes. To expand this cluster for use in production environments, more steps are recommended:

Adding more nodes

Add more nodes to the cluster using the crm cluster join script.

Enabling a hardware watchdog

Before using the cluster in a production environment, replace softdog with a hardware watchdog.

Adding more fencing devices

For critical workloads, we highly recommend having two or three fencing devices, using either physical devices or disk-based SBD.

HA glossary

active/active, active/passive

How resources run on the nodes. Active/passive means that resources only run on the active node, but can move to the passive node if the active node fails. Active/active means that all nodes are active at once, and resources can run on (and move to) any node in the cluster.

arbitrator

An arbitrator is a machine running outside the cluster to provide an additional instance for cluster calculations. For example, QNetd provides a vote to help QDevice participate in quorum decisions.

CIB (cluster information base)

An XML representation of the whole cluster configuration and status (cluster options, nodes, resources, constraints and the relationships to each other). The CIB manager (pacemaker-based) keeps the CIB synchronized across the cluster and handles requests to modify it.

clone

A clone is an identical copy of an existing node, used to make deploying multiple nodes simpler.

In the context of a cluster resource, a clone is a resource that can be active on multiple nodes. Any resource can be cloned if its resource agent supports it.

cluster

A high-availability cluster is a group of servers (physical or virtual) designed primarily to secure the highest possible availability of data, applications and services. Not to be confused with a high-performance cluster, which shares the application load to achieve faster results.

Cluster logical volume manager (Cluster LVM)

The term Cluster LVM indicates that LVM is being used in a cluster environment. This requires configuration adjustments to protect the LVM metadata on shared storage.

cluster partition

A cluster partition occurs when communication fails between one or more nodes and the rest of the cluster. The nodes are split into partitions but are still active. They can only communicate with nodes in the same partition and are unaware of the separated nodes. This is known as a split brain scenario.

cluster stack

The ensemble of software technologies and components that make up a cluster.

colocation constraint

A type of resource constraint that specifies which resources can or cannot run together on a node.

concurrency violation

A resource that should be running on only one node in the cluster is running on several nodes.

Corosync

Corosync provides reliable messaging, membership and quorum information about the cluster. This is handled by the Corosync Cluster Engine, a group communication system.

CRM (cluster resource manager)

The management entity responsible for coordinating all non-local interactions in a High Availability cluster. SUSE Linux Enterprise High Availability uses Pacemaker as the CRM. It interacts with several components: local executors on its own node and on the other nodes, non-local CRMs, administrative commands, the fencing functionality, and the membership layer.

crmsh (CRM Shell)

The command-line utility crmsh manages the cluster, nodes and resources.

Csync2

A synchronization tool for replicating configuration files across all nodes in the cluster.

DC (designated coordinator)

The pacemaker-controld daemon is the cluster controller, which coordinates all actions. This daemon has an instance on each cluster node, but only one instance is elected to act as the DC. The DC is elected when the cluster services start, or if the current DC fails or leaves the cluster. The DC decides whether a cluster-wide change must be performed, such as fencing a node or moving resources.

disaster

An unexpected interruption of critical infrastructure caused by nature, humans, hardware failure, or software bugs.

disaster recovery

The process by which a function is restored to the normal, steady state after a disaster.

Disaster Recovery Plan

A strategy to recover from a disaster with the minimum impact on IT infrastructure.

DLM (Distributed Lock Manager)

DLM coordinates accesses to shared resources in a cluster, for example, managing file locking in clustered file systems to increase performance and availability.

DRBD

DRBD® is a block device designed for building High Availability clusters. It replicates data on a primary device to secondary devices in a way that ensures all copies of the data remain identical.

existing cluster

The term existing cluster is used to refer to any cluster that consists of at least one node. An existing cluster has a basic Corosync configuration that defines the communication channels, but does not necessarily have resource configuration yet.

failover

Occurs when a resource or node fails on one machine and the affected resources move to another node.

failover domain

A named subset of cluster nodes that are eligible to run a resource if a node fails.

fencing

Prevents access to a shared resource by isolated or failing cluster members. There are two classes of fencing: resource-level fencing and node-level fencing. Resource-level fencing ensures exclusive access to a resource. Node-level fencing prevents a failed node from accessing shared resources and prevents resources from running on a node with an uncertain status. This is usually done by resetting or powering off the node.

GFS2

Global File System 2 (GFS2) is a shared disk file system for Linux computer clusters. GFS2 allows all nodes to have direct concurrent access to the same shared block storage. GFS2 has no disconnected operating mode, and no client or server roles. All nodes in a GFS2 cluster function as peers. GFS2 supports up to 32 cluster nodes. Using GFS2 in a cluster requires hardware to allow access to the shared storage, and a lock manager to control access to the storage.

group

Resource groups contain multiple resources that need to be located together, started sequentially and stopped in the reverse order.

Hawk (HA Web Konsole)

A user-friendly Web-based interface for monitoring and administering a High Availability cluster from Linux or non-Linux machines. Hawk can be accessed from any machine that can connect to the cluster nodes, using a graphical Web browser.

heuristics

QDevice supports using a set of commands (heuristics) that run locally on start-up of cluster services, cluster membership change, successful connection to the QNetd server, or optionally at regular times. The result is used in calculations to determine which partition should have quorum.

knet (kronosnet)

A network abstraction layer supporting redundancy, security, fault tolerance, and fast fail-over of network links. In SUSE Linux Enterprise High Availability 16, knet is the default transport protocol for the Corosync communication channels.

local cluster

A single cluster in one location (for example, all nodes are located in one data center). Network latency is minimal. Storage is typically accessed synchronously by all nodes.

local executor

The local executor is located between Pacemaker and the resources on each node. Through the pacemaker-execd daemon, Pacemaker can start, stop and monitor resources.

location

In the context of a whole cluster, location can refer to the physical location of nodes (for example, all nodes might be located in the same data center). In the context of a location constraint, location refers to the nodes on which a resource can or cannot run.

location constraint

A type of resource constraint that defines the nodes on which a resource can or cannot run.

meta attributes (resource options)

Parameters that tell the CRM (cluster resource manager) how to treat a specific resource. For example, you might define a resource's priority or target role.

metro cluster

A single cluster that can stretch over multiple buildings or data centers, with all sites connected by Fibre Channel. Network latency is usually low. Storage is frequently replicated using mirroring or synchronous replication.

network device bonding

Network device bonding combines two or more network interfaces into a single bonded device to increase bandwidth and/or provide redundancy. When using Corosync, the bonded device is not managed by the cluster software. Therefore, the bonded device must be configured on every cluster node that might need to access it.

node

Any server (physical or virtual) that is a member of a cluster.

order constraint

A type of resource constraint that defines the sequence of actions.

Pacemaker

Pacemaker is the CRM (cluster resource manager) in SUSE Linux Enterprise High Availability, or the brain that reacts to events occurring in the cluster. Events might be nodes that join or leave the cluster, failure of resources, or scheduled activities such as maintenance, for example. The pacemakerd daemon launches and monitors all other related daemons.

parameters (instance attributes)

Parameters determine which instance of a service the resource controls.

primitive

A primitive resource is the most basic type of cluster resource.

promotable clone

Promotable clones are a special type of clone resource that can be promoted. Active instances of these resources are divided into two states: promoted and unpromoted (also known as active and passive or primary and secondary).

QDevice

QDevice and QNetd participate in quorum decisions. The corosync-qdevice daemon runs on each cluster node and communicates with QNetd to provide a configurable number of votes, allowing a cluster to sustain more node failures than the standard quorum rules allow.

QNetd

QNetd is an arbitrator that runs outside the cluster. The corosync-qnetd daemon provides a vote to the corosync-qdevice daemon on each node to help it participate in quorum decisions.

quorum

A cluster partition is defined to have quorum (be quorate) if it has the majority of nodes (or votes). Quorum distinguishes exactly one partition. This is part of the algorithm to prevent several disconnected partitions or nodes (split brain) from proceeding and causing data and service corruption. Quorum is a prerequisite for fencing, which then ensures that quorum is unique.

RA (resource agent)

A script acting as a proxy to manage a resource (for example, to start, stop or monitor a resource). SUSE Linux Enterprise High Availability supports different kinds of resource agents.

ReaR (Relax and Recover)

An administrator tool set for creating disaster recovery images.

resource

Any type of service or application that is known to Pacemaker, for example, an IP address, a file system, or a database. The term resource is also used for DRBD, where it names a set of block devices that use a common connection for replication.

resource constraint

Resource constraints specify which cluster nodes resources can run on, what order resources load in, and what other resources a specific resource is dependent on.

See also colocation constraint, location constraint and order constraint.

resource set

As an alternative format for defining location, colocation or order constraints, you can use resource sets, where primitives are grouped together in one set. When creating a constraint, you can specify multiple resources for the constraint to apply to.

resource template

To help create many resources with similar configurations, you can define a resource template. After being defined, it can be referenced in primitives or in certain types of constraints. If a template is referenced in a primitive, the primitive inherits all operations, instance attributes (parameters), meta attributes and utilization attributes defined in the template.

SBD (STONITH Block Device)

SBD provides a node fencing mechanism through the exchange of messages via shared block storage. Alternatively, it can be used in diskless mode. In either case, it needs a hardware or software watchdog on each node to ensure that misbehaving nodes are really stopped.

scheduler

The scheduler is implemented as pacemaker-schedulerd. When a cluster transition is needed, pacemaker-schedulerd calculates the expected next state of the cluster and determines what actions need to be scheduled to achieve the next state.

split brain

A scenario in which the cluster nodes are divided into two or more groups that do not know about each other (either through a software or hardware failure). STONITH prevents a split-brain scenario from badly affecting the entire cluster. Also known as a partitioned cluster scenario.

The term split brain is also used in DRBD but means that the nodes contain different data.

SPOF (single point of failure)

Any component of a cluster that, if it fails, triggers the failure of the entire cluster.

STONITH

Another term for the fencing mechanism that shuts down a misbehaving node to prevent it from causing trouble in a cluster. In a Pacemaker cluster, node fencing is managed by the fencing subsystem pacemaker-fenced.

switchover

The planned moving of resources to other nodes in a cluster. See also failover.

utilization

Tells the CRM what capacity a certain resource requires from a node.

watchdog

SBD (STONITH Block Device) needs a watchdog on each node to ensure that misbehaving nodes are really stopped. SBD feeds the watchdog by regularly writing a service pulse to it. If SBD stops feeding the watchdog, the hardware enforces a system restart. This protects against failures of the SBD process itself, such as becoming stuck on an I/O error.