Administration Guide › Cluster Disaster Recovery

Applies to SUSE CaaS Platform 4.5.2

12 Cluster Disaster Recovery #

12.1 Backing Up etcd Cluster Data
12.2 Recovering Master Nodes

Etcd is a crucial component of Kubernetes - the etcd cluster stores the entire Kubernetes cluster state, which means critical configuration data, specifications, as well as the statuses of the running workloads. It also serves as the backend for service discovery. Chapter 13, Backup and Restore with Velero explains how to use Velero to backup, restore and migrate data. However, the Kubernetes cluster needs to be accessible for Velero to operate. And since the Kubernetes cluster can become inaccessible for many reasons, for example when all of its master nodes are lost, it is important to periodically backup etcd cluster data.

12.1 Backing Up etcd Cluster Data #

This chapter describes the backup of etcd cluster data running on master nodes of SUSE CaaS Platform.

12.1.1 Data To Backup #

Create backup directories on external storage.

BACKUP_DIR=CaaSP_Backup_`date +%Y%m%d%H%M%S`
mkdir /${BACKUP_DIR}

Copy the following files/folders into the backup directory:
- The skuba command-line binary: for the running cluster. Used to replace nodes from cluster.
- The cluster definition folder: Directory created during bootstrap holding the cluster certificates and configuration.
- The etcd cluster database: Holds all non-persistent cluster data. Can be used to recover master nodes. Please refer to the next section for steps to create an etcd cluster database backup.
(Optional) Make backup directory into a compressed file, and remove the original backup directory.
```
tar cfv ${BACKUP_DIR}.tgz /${BACKUP_DIR}
rm -rf /${BACKUP_DIR}
```

12.1.2 Creating an etcd Cluster Database Backup #

12.1.2.1 Procedure #

Mount external storage device to all master nodes. This is only required if the following step is using local hostpath as volume storage.

Create backup.

Find the size of the database to be backed up
```
ls -sh /var/lib/etcd/member/snap/db
```
Important
The backup size depends on the cluster. Ensure each of the backups has sufficient space. The available size should be more than the database snapshot file.
You should also have a rotation method to clean up the unneeded snapshots over time.
When there is insufficient space available during backup, pods will fail to be in Running state and no space left on device errors will show in pod logs.
The below example manifest shows a binding to a local hostPath. We strongly recommend using other storage methods instead.
Modify the script example
Replace <STORAGE_MOUNT_POINT> with the directory in which to store the backup. The directory must exist on every node in cluster.
Replace <IN_CLUSTER_ETCD_IMAGE> with the etcd image used in the cluster. This can be retrieved by accessing any one of the nodes in the cluster and running:
```
grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'
```

Create a backup deployment

Run the following script:

ETCD_SNAPSHOT="<STORAGE_MOUNT_POINT>/etcd_snapshot"
ETCD_IMAGE="<IN_CLUSTER_ETCD_IMAGE>"
MANIFEST="etcd-backup.yaml"

cat << *EOF* > ${MANIFEST}
apiVersion: batch/v1
kind: Job
metadata:
  name: etcd-backup
  namespace: kube-system
  labels:
    jobgroup: backup
spec:
  template:
    metadata:
      name: etcd-backup
      labels:
        jobgroup: backup
    spec:
      containers:
      - name: etcd-backup
        image: ${ETCD_IMAGE}
        env:
        - name: ETCDCTL_API
          value: "3"
        command: ["/bin/sh"]
        args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-\$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
        volumeMounts:
        - mountPath: /etc/kubernetes/pki/etcd
          name: etcd-certs
          readOnly: true
        - mountPath: /backup
          name: etcd-backup
      restartPolicy: OnFailure
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
      - effect: NoSchedule
        operator: Exists
      hostNetwork: true
      volumes:
      - name: etcd-certs
        hostPath:
          path: /etc/kubernetes/pki/etcd
          type: DirectoryOrCreate
      - name: etcd-backup
        hostPath:
          path: ${ETCD_SNAPSHOT}
          type: Directory
*EOF*

kubectl create -f ${MANIFEST}

If you are using local hostPath and not using a shared storage device, the etcd backup will be created to any one of the master nodes. To find the node associated with each etcd backup run:

kubectl get pods --namespace kube-system --selector=job-name=etcd-backup -o wide

12.1.3 Scheduling etcd Cluster Backup #

Mount external storage device to all master nodes. This is only required if the following step is using local hostPath as volume storage.

Create Cronjob.

Find the size of the database to be backed up
Important
The backup size depends on the cluster. Ensure each of the backups has sufficient space. The available size should be more than the database snapshot file.
You should also have a rotation method to clean up the unneeded snapshots over time.
When there is insufficient space available during backup, pods will fail to be in Running state and no space left on device errors will show in pod logs.
The below example manifest shows a binding to a local hostPath. We strongly recommend using other storage methods instead.
```
ls -sh /var/lib/etcd/member/snap/db
```
Modify the script example
Replace <STORAGE_MOUNT_POINT> with directory to store for backup. The directory must exist on every node in cluster.
Replace <IN_CLUSTER_ETCD_IMAGE> with etcd image used in cluster. This can be retrieved by accessing any one of the nodes in the cluster and running:
```
grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'
```

Create a backup schedule deployment

Run the following script:

ETCD_SNAPSHOT="<STORAGE_MOUNT_POINT>/etcd_snapshot"
ETCD_IMAGE="<IN_CLUSTER_ETCD_IMAGE>"

# SCHEDULE in Cron format. https://crontab.guru/
SCHEDULE="0 1 * * *"

# *_HISTORY_LIMIT is the number of maximum history keep in the cluster.
SUCCESS_HISTORY_LIMIT="3"
FAILED_HISTORY_LIMIT="3"

MANIFEST="etcd-backup.yaml"

cat << *EOF* > ${MANIFEST}
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  startingDeadlineSeconds: 100
  schedule: "${SCHEDULE}"
  successfulJobsHistoryLimit: ${SUCCESS_HISTORY_LIMIT}
  failedJobsHistoryLimit: ${FAILED_HISTORY_LIMIT}
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: etcd-backup
            image: ${ETCD_IMAGE}
            env:
            - name: ETCDCTL_API
              value: "3"
            command: ["/bin/sh"]
            args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-\$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
            volumeMounts:
            - mountPath: /etc/kubernetes/pki/etcd
              name: etcd-certs
              readOnly: true
            - mountPath: /backup
              name: etcd-backup
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/master: ""
          tolerations:
          - effect: NoSchedule
            operator: Exists
          hostNetwork: true
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
          - name: etcd-backup
            # hostPath is only one of the types of persistent volume. Suggest to setup external storage instead.
            hostPath:
              path: ${ETCD_SNAPSHOT}
              type: Directory
*EOF*

kubectl create -f ${MANIFEST}

12.2 Recovering Master Nodes #

This chapter describes how to recover SUSE CaaS Platform master nodes.

12.2.1 Replacing a Single Master Node #

Remove the failed master node with skuba.
Replace <NODE_NAME> with failed cluster master node name.
```
skuba node remove <NODE_NAME>
```
Delete failed master node from known_hosts.
Replace` <NODE_IP>` with failed master node IP address.
```
sed -i "/<NODE_IP>/d" known_hosts
```
Prepare a new instance.
Use skuba to join master node from step 3.
Replace <NODE_IP> with the new master node ip address.
Replace <NODE_NAME> with the new master node name.
Replace <USER_NAME> with user name.
```
skuba node join --role=master --user=<USER_NAME> --sudo --target <NODE_IP> <NODE_NAME>
```

12.2.2 Recovering All Master Nodes #

Ensure cluster version for backup/restore should be the same. Cross-version restoration in any domain is likely to encounter data/API compatibility issues.

12.2.2.1 Prerequisites #

You will only need to restore database on one of the master node (master-0) to regain control-plane access. etcd will sync the database to all master nodes in the cluster once restored. This does not mean, however, that the nodes will automatically be added back to the cluster. You must join one master node to the cluster, restore the database and then continue adding your remaining master nodes (which then will sync automatically).

Do the following on master-0. Remote restore is not supported.

Install one of the required software packages (etcdctl, Docker or Podman).

Etcdctl:
```
sudo zypper install etcdctl
```

Docker:

sudo zypper install docker
sudo systemctl start docker

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`

sudo docker pull ${ETCD_IMAGE}

Podman:

sudo zypper install podman

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`

sudo podman pull ${ETCD_IMAGE}

Have access to etcd snapshot from backup device.

12.2.2.2 Procedure #

Stop etcd on all master nodes.
```
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
```
You can check etcd container does not exist with crictl ps | grep etcd
Purge etcd data on all master nodes.
```
sudo rm -rf /var/lib/etcd
```
Login to master-0 via SSH.

Restore etcd data.

Replace <SNAPSHOT_DIR> with directory to the etcd snapshot, for example: /share/backup/etcd_snapshot

Replace <SNAPSHOT> with the name of the etcd snapshot, for example: etcd-snapshot-2019-11-08_05:19:20_GMT.db

Replace <NODE_NAME> with master-0 cluster node name, for example: skuba-master-1

Replace <NODE_IP> with master-0 cluster node IP address.

Important

The <NODE_IP> must be visible from inside the node.

ip addr | grep <NODE_IP>

Important

The <NODE_NAME> and <NODE_IP> must exist after --initial-cluster in /etc/kubernetes/manifests/etcd.yaml

Etcdctl:

SNAPSHOT="<SNAPSHOT_DIR>/<SNAPSHOT>"
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"

sudo ETCDCTL_API=3 etcdctl snapshot restore ${SNAPSHOT}\
 --data-dir /var/lib/etcd\
 --name ${NODE_NAME}\
 --initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
 --initial-advertise-peer-urls https://${NODE_IP}:2380

Docker:

SNAPSHOT="<SNAPSHOT>"
SNAPSHOT_DIR="<SNAPSHOT_DIR>"
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"

sudo docker run\
 -v ${SNAPSHOT_DIR}:/etcd_snapshot\
 -v /var/lib:/var/lib\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl snapshot restore /etcd_snapshot/${SNAPSHOT}\
 --data-dir /var/lib/etcd\
 --name ${NODE_NAME}\
 --initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
 --initial-advertise-peer-urls https://${NODE_IP}:2380"

Podman:

SNAPSHOT="<SNAPSHOT>"
SNAPSHOT_DIR="<SNAPSHOT_DIR>"
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"

sudo podman run\
 -v ${SNAPSHOT_DIR}:/etcd_snapshot\
 -v /var/lib:/var/lib\
 --network host\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl snapshot restore /etcd_snapshot/${SNAPSHOT}\
 --data-dir /var/lib/etcd\
 --name ${NODE_NAME}\
 --initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
 --initial-advertise-peer-urls https://${NODE_IP}:2380"

Start etcd on master-0.

mv /tmp/etcd.yaml /etc/kubernetes/manifests/

You should be able to see master-0 joined to the etcd cluster member list.

Replace <ENDPOINT_IP> with master-0 cluster node IP address.

Etcdctl:

sudo ETCDCTL_API=3 etcdctl\
 --endpoints=https://127.0.0.1:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

Docker:

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>

sudo docker run\
 -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
 --endpoints=https://${ENDPOINT}:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"

Podman:

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>

sudo podman run\
 -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
 --network host\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
 --endpoints=https://${ENDPOINT}:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"

Add another master node to the etcd cluster member list.

Replace <NODE_NAME> with cluster node name, for example: skuba-master-1

Replace <ENDPOINT_IP> with master-0 cluster node IP address.

Replace <NODE_IP> with cluster node IP address.

Important

The <NODE_IP> must be visible from inside the node.

ip addr | grep <NODE_IP>

Important

The <NODE_NAME> and <NODE_IP> must exist after --initial-cluster in /etc/kubernetes/manifests/etcd.yaml

Important

Nodes must be restored in sequence.

Etcdctl:

NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"

sudo ETCDCTL_API=3 etcdctl\
 --endpoints=https://127.0.0.1:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
 member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380

Docker:

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"

sudo docker run\
 -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
 --endpoints=https://${ENDPOINT}:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
 member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380"

Podman:

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"

sudo podman run\
 -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
 --network host\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
 --endpoints=https://${ENDPOINT}:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
 member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380"

Start etcd.

cp /tmp/etcd.yaml /etc/kubernetes/manifests/

Repeat step 7, 8, 9 to recover all remaining master nodes.

12.2.2.3 Confirming the Restoration #

After restoring, execute the below command to confirm the procedure. A successful restoration will show master nodes in etcd member list started, and all Kubernetes nodes in STATUS Ready.

Etcdctl:

sudo ETCDCTL_API=3 etcdctl\
 --endpoints=https://127.0.0.1:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

# EXAMPLE
116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379

Docker:

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`

# Replace <ENDPOINT_IP> with `master-0` cluster node IP address.
ENDPOINT=<ENDPOINT_IP>

sudo docker run\
 -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
 --endpoints=https://${ENDPOINT}:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"

# EXAMPLE
116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379

Podman:

ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`

# Replace <ENDPOINT_IP> with `master-0` cluster node IP address.
ENDPOINT=<ENDPOINT_IP>

sudo podman run\
 -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
 --network host\
 --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
 --endpoints=https://${ENDPOINT}:2379\
 --cacert=/etc/kubernetes/pki/etcd/ca.crt\
 --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
 --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"

# EXAMPLE
116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379

Kubectl:

kubectl get nodes

# EXAMPLE
NAME                          STATUS   ROLES    AGE      VERSION
caasp-master-cluster-0        Ready    master   28m      v1.16.2
caasp-master-cluster-1        Ready    master   20m      v1.16.2
caasp-master-cluster-2        Ready    master   12m      v1.16.2
caasp-worker-cluster-0        Ready    <none>   36m36s   v1.16.2

Print this page