This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.
Etcd is a crucial component of Kubernetes - the etcd cluster stores the entire Kubernetes cluster state, which means critical configuration data, specifications, as well as the statuses of the running workloads. It also serves as the backend for service discovery. Chapter 13, Backup and Restore with Velero explains how to use Velero to backup, restore and migrate data. However, the Kubernetes cluster needs to be accessible for Velero to operate. And since the Kubernetes cluster can become inaccessible for many reasons, for example when all of its master nodes are lost, it is important to periodically backup etcd cluster data.
This chapter describes the backup of etcd cluster data running on master nodes of SUSE CaaS Platform.
Create backup directories on external storage.
BACKUP_DIR=CaaSP_Backup_`date +%Y%m%d%H%M%S`
mkdir /${BACKUP_DIR}Copy the following files/folders into the backup directory:
The skuba command-line binary: for the running cluster. Used to replace nodes from cluster.
The cluster definition folder: Directory created during bootstrap holding the cluster certificates and configuration.
The etcd cluster database: Holds all non-persistent cluster data.
Can be used to recover master nodes. Please refer to the next section for steps to create an etcd cluster database backup.
(Optional) Make backup directory into a compressed file, and remove the original backup directory.
tar cfv ${BACKUP_DIR}.tgz /${BACKUP_DIR}
rm -rf /${BACKUP_DIR}Mount external storage device to all master nodes. This is only required if the following step is using local hostpath as volume storage.
Create backup.
Find the size of the database to be backed up
ls -sh /var/lib/etcd/member/snap/dbThe backup size depends on the cluster. Ensure each of the backups has sufficient space. The available size should be more than the database snapshot file.
You should also have a rotation method to clean up the unneeded snapshots over time.
When there is insufficient space available during backup, pods will fail to be in Running state and no space left on device errors will show in pod logs.
The below example manifest shows a binding to a local hostPath.
We strongly recommend using other storage methods instead.
Modify the script example
Replace <STORAGE_MOUNT_POINT> with the directory in which to store the backup.
The directory must exist on every node in cluster.
Replace <IN_CLUSTER_ETCD_IMAGE> with the etcd image used in the cluster.
This can be retrieved by accessing any one of the nodes in the cluster and running:
grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'Create a backup deployment
Run the following script:
ETCD_SNAPSHOT="<STORAGE_MOUNT_POINT>/etcd_snapshot"
ETCD_IMAGE="<IN_CLUSTER_ETCD_IMAGE>"
MANIFEST="etcd-backup.yaml"
cat << *EOF* > ${MANIFEST}
apiVersion: batch/v1
kind: Job
metadata:
name: etcd-backup
namespace: kube-system
labels:
jobgroup: backup
spec:
template:
metadata:
name: etcd-backup
labels:
jobgroup: backup
spec:
containers:
- name: etcd-backup
image: ${ETCD_IMAGE}
env:
- name: ETCDCTL_API
value: "3"
command: ["/bin/sh"]
args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-\$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
volumeMounts:
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
readOnly: true
- mountPath: /backup
name: etcd-backup
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/master: ""
tolerations:
- effect: NoSchedule
operator: Exists
hostNetwork: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
- name: etcd-backup
hostPath:
path: ${ETCD_SNAPSHOT}
type: Directory
*EOF*
kubectl create -f ${MANIFEST}If you are using local hostPath and not using a shared storage device, the etcd backup will be created to any one of the master nodes.
To find the node associated with each etcd backup run:
kubectl get pods --namespace kube-system --selector=job-name=etcd-backup -o wideMount external storage device to all master nodes.
This is only required if the following step is using local hostPath as volume storage.
Create Cronjob.
Find the size of the database to be backed up
The backup size depends on the cluster. Ensure each of the backups has sufficient space. The available size should be more than the database snapshot file.
You should also have a rotation method to clean up the unneeded snapshots over time.
When there is insufficient space available during backup, pods will fail to be in Running state and no space left on device errors will show in pod logs.
The below example manifest shows a binding to a local hostPath.
We strongly recommend using other storage methods instead.
ls -sh /var/lib/etcd/member/snap/dbModify the script example
Replace <STORAGE_MOUNT_POINT> with directory to store for backup. The directory must exist on every node in cluster.
Replace <IN_CLUSTER_ETCD_IMAGE> with etcd image used in cluster.
This can be retrieved by accessing any one of the nodes in the cluster and running:
grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'Create a backup schedule deployment
Run the following script:
ETCD_SNAPSHOT="<STORAGE_MOUNT_POINT>/etcd_snapshot"
ETCD_IMAGE="<IN_CLUSTER_ETCD_IMAGE>"
# SCHEDULE in Cron format. https://crontab.guru/
SCHEDULE="0 1 * * *"
# *_HISTORY_LIMIT is the number of maximum history keep in the cluster.
SUCCESS_HISTORY_LIMIT="3"
FAILED_HISTORY_LIMIT="3"
MANIFEST="etcd-backup.yaml"
cat << *EOF* > ${MANIFEST}
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
startingDeadlineSeconds: 100
schedule: "${SCHEDULE}"
successfulJobsHistoryLimit: ${SUCCESS_HISTORY_LIMIT}
failedJobsHistoryLimit: ${FAILED_HISTORY_LIMIT}
jobTemplate:
spec:
template:
spec:
containers:
- name: etcd-backup
image: ${ETCD_IMAGE}
env:
- name: ETCDCTL_API
value: "3"
command: ["/bin/sh"]
args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-\$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
volumeMounts:
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
readOnly: true
- mountPath: /backup
name: etcd-backup
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/master: ""
tolerations:
- effect: NoSchedule
operator: Exists
hostNetwork: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
- name: etcd-backup
# hostPath is only one of the types of persistent volume. Suggest to setup external storage instead.
hostPath:
path: ${ETCD_SNAPSHOT}
type: Directory
*EOF*
kubectl create -f ${MANIFEST}This chapter describes how to recover SUSE CaaS Platform master nodes.
Remove the failed master node with skuba.
Replace <NODE_NAME> with failed cluster master node name.
skuba node remove <NODE_NAME>
Delete failed master node from known_hosts.
Replace` <NODE_IP>` with failed master node IP address.
sed -i "/<NODE_IP>/d" known_hosts
Prepare a new instance.
Use skuba to join master node from step 3.
Replace <NODE_IP> with the new master node ip address.
Replace <NODE_NAME> with the new master node name.
Replace <USER_NAME> with user name.
skuba node join --role=master --user=<USER_NAME> --sudo --target <NODE_IP> <NODE_NAME>
Ensure cluster version for backup/restore should be the same. Cross-version restoration in any domain is likely to encounter data/API compatibility issues.
You will only need to restore database on one of the master node (master-0) to regain control-plane access.
etcd will sync the database to all master nodes in the cluster once restored.
This does not mean, however, that the nodes will automatically be added back to the cluster.
You must join one master node to the cluster, restore the database and then continue adding your remaining master nodes (which then will sync automatically).
Do the following on master-0. Remote restore is not supported.
Install one of the required software packages (etcdctl, Docker or Podman).
Etcdctl:
sudo zypper install etcdctl
Docker:
sudo zypper install docker
sudo systemctl start docker
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
sudo docker pull ${ETCD_IMAGE}Podman:
sudo zypper install podman
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
sudo podman pull ${ETCD_IMAGE}Have access to etcd snapshot from backup device.
Stop etcd on all master nodes.
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
You can check etcd container does not exist with crictl ps | grep etcd
Purge etcd data on all master nodes.
sudo rm -rf /var/lib/etcd
Login to master-0 via SSH.
Restore etcd data.
Replace <SNAPSHOT_DIR> with directory to the etcd snapshot,
for example: /share/backup/etcd_snapshot
Replace <SNAPSHOT> with the name of the etcd snapshot,
for example: etcd-snapshot-2019-11-08_05:19:20_GMT.db
Replace <NODE_NAME> with master-0 cluster node name,
for example: skuba-master-1
Replace <NODE_IP> with master-0 cluster node IP address.
The <NODE_IP> must be visible from inside the node.
ip addr | grep <NODE_IP>The <NODE_NAME> and <NODE_IP> must exist after --initial-cluster in /etc/kubernetes/manifests/etcd.yaml
Etcdctl:
SNAPSHOT="<SNAPSHOT_DIR>/<SNAPSHOT>"
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"
sudo ETCDCTL_API=3 etcdctl snapshot restore ${SNAPSHOT}\
--data-dir /var/lib/etcd\
--name ${NODE_NAME}\
--initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
--initial-advertise-peer-urls https://${NODE_IP}:2380Docker:
SNAPSHOT="<SNAPSHOT>"
SNAPSHOT_DIR="<SNAPSHOT_DIR>"
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"
sudo docker run\
-v ${SNAPSHOT_DIR}:/etcd_snapshot\
-v /var/lib:/var/lib\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl snapshot restore /etcd_snapshot/${SNAPSHOT}\
--data-dir /var/lib/etcd\
--name ${NODE_NAME}\
--initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
--initial-advertise-peer-urls https://${NODE_IP}:2380"Podman:
SNAPSHOT="<SNAPSHOT>"
SNAPSHOT_DIR="<SNAPSHOT_DIR>"
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"
sudo podman run\
-v ${SNAPSHOT_DIR}:/etcd_snapshot\
-v /var/lib:/var/lib\
--network host\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl snapshot restore /etcd_snapshot/${SNAPSHOT}\
--data-dir /var/lib/etcd\
--name ${NODE_NAME}\
--initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
--initial-advertise-peer-urls https://${NODE_IP}:2380"Start etcd on master-0.
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
You should be able to see master-0 joined to the etcd cluster member list.
Replace <ENDPOINT_IP> with master-0 cluster node IP address.
Etcdctl:
sudo ETCDCTL_API=3 etcdctl\ --endpoints=https://127.0.0.1:2379\ --cacert=/etc/kubernetes/pki/etcd/ca.crt\ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
Docker:
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>
sudo docker run\
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
--endpoints=https://${ENDPOINT}:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"Podman:
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>
sudo podman run\
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
--network host\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
--endpoints=https://${ENDPOINT}:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"Add another master node to the etcd cluster member list.
Replace <NODE_NAME> with cluster node name,
for example: skuba-master-1
Replace <ENDPOINT_IP> with master-0 cluster node IP address.
Replace <NODE_IP> with cluster node IP address.
The <NODE_IP> must be visible from inside the node.
ip addr | grep <NODE_IP>The <NODE_NAME> and <NODE_IP> must exist after --initial-cluster in /etc/kubernetes/manifests/etcd.yaml
Nodes must be restored in sequence.
Etcdctl:
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"
sudo ETCDCTL_API=3 etcdctl\
--endpoints=https://127.0.0.1:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380Docker:
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"
sudo docker run\
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
--endpoints=https://${ENDPOINT}:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380"Podman:
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
ENDPOINT=<ENDPOINT_IP>
NODE_NAME="<NODE_NAME>"
NODE_IP="<NODE_IP>"
sudo podman run\
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
--network host\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
--endpoints=https://${ENDPOINT}:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380"Login to the node in step 7 via SSH.
Start etcd.
cp /tmp/etcd.yaml /etc/kubernetes/manifests/
Repeat step 7, 8, 9 to recover all remaining master nodes.
After restoring, execute the below command to confirm the procedure. A successful restoration will show master nodes in etcd member list started, and all Kubernetes nodes in STATUS Ready.
Etcdctl:
sudo ETCDCTL_API=3 etcdctl\ --endpoints=https://127.0.0.1:2379\ --cacert=/etc/kubernetes/pki/etcd/ca.crt\ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list # EXAMPLE 116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379 3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379 43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379
Docker:
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
# Replace <ENDPOINT_IP> with `master-0` cluster node IP address.
ENDPOINT=<ENDPOINT_IP>
sudo docker run\
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
--endpoints=https://${ENDPOINT}:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"
# EXAMPLE
116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379Podman:
ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
# Replace <ENDPOINT_IP> with `master-0` cluster node IP address.
ENDPOINT=<ENDPOINT_IP>
sudo podman run\
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
--network host\
--entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
ETCDCTL_API=3 etcdctl\
--endpoints=https://${ENDPOINT}:2379\
--cacert=/etc/kubernetes/pki/etcd/ca.crt\
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"
# EXAMPLE
116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379Kubectl:
kubectl get nodes # EXAMPLE NAME STATUS ROLES AGE VERSION caasp-master-cluster-0 Ready master 28m v1.16.2 caasp-master-cluster-1 Ready master 20m v1.16.2 caasp-master-cluster-2 Ready master 12m v1.16.2 caasp-worker-cluster-0 Ready <none> 36m36s v1.16.2