Introduction to SUSE Linux Enterprise High Availability

1 What is SUSE Linux Enterprise High Availability? #

SUSE® Linux Enterprise High Availability is an integrated suite of open source clustering technologies. A high-availability cluster is a group of servers (nodes) that work together to ensure the highest possible availability of data, applications and services. If one node fails, the resources that were running on it move to another node with little or no downtime. You can also manually move resources between nodes for load balancing, or to perform maintenance tasks with minimal downtime. Cluster resources can include Web sites, e-mail servers, databases, file systems, virtual machines and any other server-based applications or services that must be available to users at all times.

1.1 Product availability #

SUSE Linux Enterprise High Availability is available with the following products:

SUSE Linux Enterprise Server: High Availability is available as an extension. This requires an additional registration code.
SUSE Linux Enterprise Server for SAP applications: High Availability is included as a module. No additional registration code is required.

1.2 Capabilities and benefits #

SUSE Linux Enterprise High Availability eliminates single points of failure to improve the availability and manageability of critical resources. This helps you maintain business continuity, protect data integrity, and reduce unplanned downtime for critical workloads.

Multiple clustering options

SUSE Linux Enterprise High Availability clusters can be configured in different ways:

Local clusters: a single cluster in one location (for example, all nodes are located in one data center). Network latency is minimal. Storage is typically accessed synchronously by all nodes.
Metro clusters (“stretched” clusters): a single cluster that can stretch over multiple buildings or data centers, with all sites connected by Fibre Channel. Network latency is usually low. Storage is frequently replicated using mirroring or synchronous replication.
Hybrid clusters: virtual servers can be clustered together with physical servers. This improves service availability and resource usage.

Important: No support for mixed architectures

Clusters with mixed architectures are not supported. All nodes in the same cluster must have the same processor platform: AMD64/Intel 64, IBM Z, or POWER.

Flexible and scalable resource management

SUSE Linux Enterprise High Availability supports clusters of up to 32 nodes. Resources can automatically move to another node if the current node fails, or they can be moved manually to troubleshoot hardware or balance the workload. Resources can also be configured to move back to repaired nodes at a specific time. The cluster can stop and start resources based on configurable rules.

Wide range of resource agents

The cluster manages resources via resource agents (RAs). SUSE Linux Enterprise High Availability supports many different resource agents that are designed to manage specific types of applications or services, including Apache, IPv4, IPv6, NFS and many more. It also ships with resource agents for third-party applications such as IBM WebSphere Application Server.

Storage and data replication

SUSE Linux Enterprise High Availability supports Fibre Channel or iSCSI storage area networks (SANs), allowing you to dynamically assign and reassign server storage as needed. It also comes with GFS2 and the cluster Logical Volume Manager (Cluster LVM). For data replication, use DRBD* to mirror a resource's data from the active node to a standby node.

Support for virtualized environments

SUSE Linux Enterprise High Availability supports the mixed clustering of both physical and virtual Linux servers. The cluster can recognize and manage resources running in virtual servers and in physical servers, and can also manage KVM virtual machines as resources.

Disaster recovery

SUSE Linux Enterprise High Availability ships with Relax-and-Recover (ReaR), a disaster recovery framework that helps to back up and restore systems.

User-friendly administration tools

SUSE Linux Enterprise High Availability includes tools for configuration and administration:

The CRM Shell (crmsh) is a command-line interface for installing and setting up High Availability clusters, configuring resources, and performing monitoring and administration tasks.
Hawk is a Web-based graphical interface for monitoring and administration of High Availability clusters. It can be accessed using a Web browser from any Linux or non-Linux machine that can connect to the cluster nodes.

1.3 How does High Availability work? #

The following figure shows a three-node cluster. Each of the nodes has a Web server installed and hosts two Web sites. All the data, graphics and Web page content for each Web site are stored on a shared disk subsystem connected to each of the nodes.

This diagram shows three cluster nodes: Web server 1, Web server 2, and Web server 3. Each Web server has two Web sites on it. Each Web server is also connected to a Fibre Channel switch, which is then connected to shared storage.

Figure 1: Three-node cluster #

During normal cluster operation, each node is in constant communication with the other nodes in the cluster and periodically checks resources to detect failure.

The following figure shows how the cluster moves resources when Web server 1 fails due to hardware or software problems.

This diagram shows three cluster nodes: Web server 1, Web server 2, and Web server 3. Web server 1 is crossed out. Web server 2 and Web server 3 each have three Web sites. Web site A and Web site B are highlighted in orange to show that they moved away from the crossed-out Web server 1.

Figure 2: Three-node cluster after one node fails #

Web site A moves to Web server 2 and Web site B moves to Web server 3. The Web sites continue to be available and are evenly distributed between the remaining cluster nodes.

When Web server 1 failed, the High Availability software performed the following tasks:

Detected a failure and verified that Web server 1 was really down.
Remounted the shared data directories that were mounted on Web server 1 on Web servers 2 and 3.
Restarted applications that were running on Web server 1 on Web servers 2 and 3.
Transferred certificates and IP addresses to Web servers 2 and 3.

In this example, failover happened quickly and users regained access to the Web sites within seconds, usually without needing to log in again.

When Web server 1 returns to a normal operating state, Web site A and Web site B can move back (“fail back”) to Web server 1 automatically. This incurs some downtime, so alternatively you can configure resources to only fail back at a specified time that causes minimal service interruption.

2 Core concepts #

This section explains the core concepts of SUSE Linux Enterprise High Availability.

2.1 Clusters and nodes #

A high-availability cluster is a group of servers that work together to ensure the availability of applications or services. Servers that are configured as members of the cluster are called nodes. If one node fails, the resources running on it can move to another node in the cluster. SUSE Linux Enterprise High Availability supports clusters with up to 32 nodes. However, clusters usually fall into one of two categories: two-node clusters, or clusters with an odd number of nodes (typically three or five).

2.2 Communication channels #

Internal cluster communication is handled by Corosync. The Corosync Cluster Engine is a group communication system that provides messaging, membership and quorum information about the cluster. In SUSE Linux Enterprise High Availability 16, Corosync uses kronosnet (knet) as the default transport protocol.

We highly recommend configuring at least two communication channels for the cluster. The preferred method is to use network device bonding. If you cannot use a network bond, you can set up a redundant communication channel in Corosync (also known as a second “ring”).

2.3 Resource management #

In a High Availability cluster, the applications and services that need to be highly available are called resources. Cluster resources can include Web sites, e-mail servers, databases, file systems, virtual machines and any other server-based applications or services you want to make available to users at all times. You can start, stop, monitor and move resources as needed. You can also specify whether specific resources should run together on the same node, or start and stop in sequential order. If a cluster node fails, the resources running on it fail over (move) to another node instead of being lost.

In SUSE Linux Enterprise High Availability, the cluster resource manager (CRM) is Pacemaker, which manages and coordinates all cluster services. Pacemaker uses resource agents (RAs) to start, stop and monitor resources. A resource agent abstracts the resource it manages and presents its status to the cluster. SUSE Linux Enterprise High Availability supports many different resource agents that are designed to manage specific types of applications or services.

2.4 Node fencing #

In a split-brain scenario, cluster nodes are divided into two or more groups (or partitions) that do not know about each other. This might be because of a hardware or software failure, or a failed network connection, for example. A split-brain scenario can be resolved by fencing (resetting or powering off) one or more of the nodes. Node fencing prevents a failed node from accessing shared resources and prevents cluster resources from running on a node with an uncertain status. This helps protect the cluster from data corruption.

To be supported, all SUSE Linux Enterprise High Availability clusters must have at least one node fencing device configured. For critical workloads, we recommend using two or three fencing devices. A fencing device can be either a physical device (a power switch) or a software mechanism (SBD in combination with a watchdog).

2.5 Quorum calculation #

When communication fails between one or more nodes and the rest of the cluster (a split-brain scenario), a cluster partition occurs. The nodes can only communicate with other nodes in the same partition and are unaware of the separated nodes. A cluster partition has quorum (or is “quorate”) if it has the majority of nodes (or “votes”). This is determined by quorum calculation. Quorum must be calculated so the non-quorate nodes can be fenced.

Corosync calculates quorum based on the following formula:

N ≥ C/2 + 1

N = minimum number of operational nodes
C = number of cluster nodes

For example, a five-node cluster needs a minimum of three operational nodes to maintain quorum.

Clusters with an even number of nodes, especially two-node clusters, might have equal numbers of nodes in each partition and therefore no majority. To avoid this situation, you can configure the cluster to use QDevice in combination with QNetd. QNetd is an arbitrator running outside the cluster. It communicates with the QDevice daemon running on the cluster nodes to provide a configurable number of votes for quorum calculation. This lets a cluster sustain more node failures than the standard quorum rules allow.

2.6 Storage and data replication #

High Availability clusters might include a shared disk subsystem connected via Fibre Channel or iSCSI. If a node fails, another node in the cluster automatically mounts the shared disk directories that were previously mounted on the failed node. This gives users continuous access to the directories on the shared disk subsystem. Content stored on the shared disk might include data, applications and services.

The following figure shows what a typical Fibre Channel cluster configuration might look like. The green lines depict connections to an Ethernet power switch, which can reboot a node if a ping request fails.

This diagram shows shared storage connected to a Fibre Channel switch. The switch is then connected to six servers. Each of the six servers is also connected to a network hub, which is then connected to an Ethernet power switch.

Figure 3: Typical Fibre Channel cluster configuration #

The following figure shows what a typical iSCSI cluster configuration might look like.

This diagram shows shared storage connected to an Ethernet switch. The switch is then connected to six servers. Each of the six servers is also connected to a network hub, which is then connected to an Ethernet power switch and backed by a network backbone.

Figure 4: Typical iSCSI cluster configuration #

Although most clusters include a shared disk subsystem, you can also create a cluster without a shared disk subsystem. The following figure shows what a cluster without a shared disk subsystem might look like.

This diagram shows an Ethernet power switch connected to a network hub. The hub is then connected to six servers.

Figure 5: Typical cluster configuration without shared storage #

3 Architecture overview #

This section explains the architecture of a SUSE Linux Enterprise High Availability cluster and how the different components interoperate.

3.1 Architecture layers #

SUSE Linux Enterprise High Availability has a layered architecture. The following figure shows the different layers and their associated components.

This diagram shows the layers of components on two cluster nodes. Some components are local to the node, and some communicate across nodes, such as Corosync and the CIB.

Figure 6: Architecture of a SUSE Linux Enterprise High Availability cluster #

Membership and messaging (Corosync)

The Corosync Cluster Engine is a group communication system that provides messaging, membership and quorum information about the cluster.

Cluster resource manager (Pacemaker)

Pacemaker is the cluster resource manager that reacts to events occurring in the cluster. Events might be nodes that join or leave the cluster, failure of resources, or scheduled activities such as maintenance, for example. The pacemakerd daemon launches and monitors all other related daemons.

The following components are also part of the Pacemaker layer:

Cluster Information Database (CIB)
On every node, Pacemaker maintains the cluster information database (CIB). This is an XML representation of the cluster configuration (including cluster options, nodes, resources, constraints and the relationships to each other). The CIB also reflects the current cluster status. The CIB manager (pacemaker-based) keeps the CIB synchronized across the cluster and handles reading and writing cluster configuration and status.
Designated Coordinator (DC)
The pacemaker-controld daemon is the cluster controller, which coordinates all actions. This daemon has an instance on each cluster node, but only one instance is elected to act as the DC. The DC is elected when the cluster services start, or if the current DC fails or leaves the cluster. The DC decides whether a cluster-wide change must be performed, such as fencing a node or moving resources.
Scheduler
The scheduler runs on every node as the pacemaker-schedulerd daemon, but is only active on the DC. When a cluster transition is needed, the scheduler calculates the expected next state of the cluster and determines what actions need to be scheduled to achieve that state.

Local executor

The local executor is located between the Pacemaker layer and the resources layer on each node. The pacemaker-execd daemon allows Pacemaker to start, stop and monitor resources.

Resources and resource agents

In a High Availability cluster, the services that need to be highly available are called resources. Resource agents (RAs) are scripts that start, stop and monitor cluster resources.

3.2 Process flow #

Many actions performed in the cluster cause a cluster-wide change, for example, adding or removing a cluster resource or changing resource constraints. The following example explains what happens in the cluster when you perform such an action:

Example 1: Cluster process when you add a new resource #

You use the CRM Shell or Hawk to add a new cluster resource. You can do this from any node in the cluster. Adding the resource modifies the CIB.
The CIB change is replicated to all cluster nodes.
Based on the information in the CIB, pacemaker-schedulerd calculates the ideal state of the cluster and how it should be achieved. It feeds a list of instructions to the DC.
The DC sends commands via Corosync, which are received by the pacemaker-controld instances on the other nodes.
Each node uses its local executor (pacemaker-execd) to perform resource modifications. The pacemaker-execd daemon is not cluster-aware and interacts directly with resource agents.
All nodes report the results of their operations back to the DC.
If fencing is required, pacemaker-fenced calls the fencing agent to fence the node.
After the DC concludes that all necessary operations have been successfully performed, the cluster returns to an idle state and waits for further events.
If any operation was not carried out as planned, pacemaker-schedulerd is invoked again with the new information recorded in the CIB.

4 Installation options #

This section describes the different options for installing a SUSE Linux Enterprise High Availability cluster and includes links to the available installation guides.

The following quick start guides explain how to set up a minimal cluster with default settings:

Installing a Basic Two-Node High Availability Cluster: This quick start guide describes how to set up a basic two-node High Availability cluster with QDevice, diskless SBD and a software watchdog. QDevice is required for this setup so that diskless SBD can handle split-brain scenarios for the two-node cluster.
Installing a Basic Three-Node High Availability Cluster: This quick start guide describes how to set up a basic three-node High Availability cluster with diskless SBD and a software watchdog. Three nodes are required for this setup so that diskless SBD can handle split-brain scenarios without the help of QDevice.

5 For more information #

For more information about SUSE Linux Enterprise High Availability, see the following resources:

https://clusterlabs.org/: The upstream project for many of the components in SUSE Linux Enterprise High Availability.
https://www.clusterlabs.org/pacemaker/doc/: Documentation for Pacemaker. For SUSE Linux Enterprise High Availability 16, see the documentation for Pacemaker 3.

6 Legal Notice #

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.

For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.