Administration Guide › Cluster Management

Applies to SUSE CaaS Platform 4.5.2

2 Cluster Management #

2.1 Prerequisites
2.2 Bootstrap and Initial Configuration
2.3 Adding Nodes
2.4 Removing Nodes
2.5 Reconfiguring Nodes
2.6 Node Operations
2.7 Graceful Cluster Shutdown & Startup
2.8 Post Startup Activities

Cluster management refers to several processes in the life cycle of a cluster and its individual nodes: bootstrapping, joining and removing nodes. For maximum automation and ease SUSE CaaS Platform uses the skuba tool, which simplifies Kubernetes cluster creation and reconfiguration.

2.1 Prerequisites #

You must have the proper SSH keys for accessing the nodes set up and allow passwordless sudo on the nodes in order to perform many of these steps. If you have followed the standard deployment procedures this should already be the case.

Please note: If you are using a different management workstation than the one you have used during the initial deployment, you might have to transfer the SSH identities from the original management workstation.

2.2 Bootstrap and Initial Configuration #

Bootstrapping the cluster is the initial process of starting up a minimal viable cluster and joining the first master node. Only the first master node needs to be bootstrapped, later nodes can simply be joined as described in Section 2.3, “Adding Nodes”.

Before bootstrapping any nodes to the cluster, you need to create an initial cluster definition folder (initialize the cluster). This is done using skuba cluster init and its --control-plane flag.

For a step by step guide on how to initialize the cluster, configure updates using kured and subsequently bootstrap nodes to it, refer to the SUSE CaaS Platform Deployment Guide.

2.3 Adding Nodes #

Once you have added the first master node to the cluster using skuba node bootstrap, use the skuba node join command to add more nodes. Joining master or worker nodes to an existing cluster should be done sequentially, meaning the nodes have to be added one after another and not more of them in parallel.

skuba node join --role <MASTER/WORKER> --user <USER_NAME> --sudo --target <IP/FQDN> <NODE_NAME>

The mandatory flags for the join command are --role, --user, --sudo and --target.

--role serves to specify if the node is a master or worker.
--sudo is for running the command with superuser privileges, which is necessary for all node operations.
<USER_NAME> is the name of the user that exists on your SLES machine (default: sles).
--target <IP/FQDN> is the IP address or FQDN of the relevant machine.
<NODE_NAME> is how you decide to name the node you are adding.

Important

New master nodes that you didn’t initially include in your Terraform’s configuration have to be manually added to your load balancer’s configuration.

To add a new worker node, you would run something like:

skuba node join --role worker --user sles --sudo --target 10.86.2.164 worker1

2.3.1 Adding Nodes from Template #

If you are using a virtual machine template for creating new cluster nodes, you must make sure that before joining the cloned machine to the cluster it is updated to the same software versions than the other nodes in the cluster.

Refer to Section 4.1, “Update Requirements”.

Nodes with mismatching package or container software versions might not be fully functional.

2.4 Removing Nodes #

2.4.1 Temporary Removal #

If you wish to remove a node temporarily, the recommended approach is to first drain the node.

When you want to bring the node back, you only have to uncordon it.

Tip

For instructions on how to perform these operations refer to Section 2.6, “Node Operations”.

2.4.2 Permanent Removal #

Important

Nodes removed with this method cannot be added back to the cluster or any other skuba-initiated cluster. You must reinstall the entire node and then join it again to the cluster.

The skuba node remove command serves to permanently remove nodes. Running this command will work even if the target virtual machine is down, so it is the safest way to remove the node.

skuba node remove <NODE_NAME> [flags]

Note

Per default, node removal has an unlimited timeout on waiting for the node to drain. If the node is unreachable it can not be drained and thus the removal will fail or get stuck indefinitely. You can specify a time after which removal will be performed without waiting for the node to drain with the flag --drain-timeout <DURATION>.

For example, waiting for the node to drain for 1 minute and 5 seconds:

skuba node remove caasp-worker1 --drain-timeout 1m5s

For a list of supported time formats run skuba node remove -h.

Important

After the removal of a master node, you have to manually delete its entries from your load balancer’s configuration.

2.5 Reconfiguring Nodes #

To reconfigure a node, for example to change the node’s role from worker to master, you will need to use a combination of commands.

Run skuba node remove <NODE_NAME>.
Reinstall the node from scratch.
Run skuba node join --role <DESIRED_ROLE> --user <USER_NAME> --sudo --target <IP/FQDN> <NODE_NAME>.

2.6 Node Operations #

2.6.1 Uncordon and Cordon #

These to commands respectively define if a node is marked as schedulable or unschedulable. This means that a node is allowed to or not allowed to receive any new workloads. This can be useful when troubleshooting a node.

To mark a node as unschedulable run:

kubectl cordon <NODE_NAME>

To mark a node as schedulable run:

kubectl uncordon <NODE_NAME>

2.6.2 Draining Nodes #

Draining a node consists of evicting all the running pods from the current node in order to perform maintenance. This is a mandatory step in order to ensure a proper functioning of the workloads. This is achieved using kubectl.

To drain a node run:

kubectl drain <NODE_NAME>

This action will also implicitly cordon the node. Therefore once the maintenance is done, uncordon the node to set it back to schedulable.

Refer to the official Kubernetes documentation for more information: https://v1-18.docs.kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service

2.7 Graceful Cluster Shutdown & Startup #

In some scenarios like maintenance windows in your datacenter or some disaster scenarios, you will want to shut down the cluster in a controlled fashion and later on bring it back up safely. Follow the following instructions to safely stop all workloads.

2.7.1 Cluster Shutdown #

Warning: Document Scope

This document is only concerned with shutting down the SUSE CaaS Platform cluster itself.

Warning: Storage Shutdown/Startup

Any real time data streaming workloads will lose data if not rerouted to an alternative cluster.

Any workloads that hold data only in memory will lose this data. Please check with the provider of your workload/application about proper data persistence in case of shutdown.

Any external storage services must be stopped/started separately. Please refer to the respective storage solution’s documentation.

Create a backup of your cluster.
Scale all applications down to zero by using either the manifests or deployment names:
Important
Do not scale down cluster services.
```
kubectl scale --replicas=0 -f deployment.yaml
```
or
```
kubectl scale deploy my-deployment --replicas=0
```
Drain/cordon all worker nodes.
```
kubectl drain <node name>
```
Note
Wait for the command to finish by itself, if it fails check for Help
Run kubectl get nodes and make sure all your worker nodes have the status Ready,SchedulingDisabled.
Proceed to shutdown all your worker nodes on the machine level.

Now it is necessary to find out where the etcd leader is running, that is going to be the last node to shut down. Find out which pods are running etcd:

$ kubectl get pods -n kube-system -o wide
NAME                         READY   STATUS    RESTARTS   AGE    IP              NODE                     NOMINATED NODE   READINESS GATES
...
etcd-master-pimp-general-00  1/1     Running   0          23m     10.84.73.114   master-pimp-general-00   <none>           <none>
...

Then you need to get the list of active etcd members, this will also show which master node is currently the etcd leader. Either run a terminal session on one of the etcd pods:

kubectl exec -ti -n kube-system etcd-master01 -- sh

# Now run this command
etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

or directly execute the command on the pod:

kubectl exec -ti -n kube-system etcd-master01 -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

The output will be the same. Note the boolean values at the end of each line. The current etcd leader will have true. In this case the node master02 is the current etcd leader.

356ebc35f3e8b25, started, master02, https://172.28.0.16:2380, https://172.28.0.16:2379, true
bdef0dced3caa0d4, started, master01, https://172.28.0.15:2380, https://172.28.0.15:2379, false
f9ae57d57b369ede, started, master03, https://172.28.0.21:2380, https://172.28.0.21:2379, false

Shutdown all other master nodes, leaving the current etcd leader for last.
Finally, shut down the etcd leader node.
Tip
This is the first node that needs to be started back up.

2.7.2 Cluster Startup #

To start up your cluster again, first start your etcd leader and wait until you get status Ready, like this:

skuba cluster status

NAME       STATUS     ROLE     OS-IMAGE                              KERNEL-VERSION         KUBELET-VERSION   CONTAINER-RUNTIME   HAS-UPDATES   HAS-DISRUPTIVE-UPDATES   CAASP-RELEASE-VERSION
master01   NotReady   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master02   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master03   NotReady   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker01   NotReady,SchedulingDisabled   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker02   NotReady,SchedulingDisabled   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5

Start the rest of the master nodes, and wait for them to become Ready:

skuba cluster status

NAME       STATUS     ROLE     OS-IMAGE                              KERNEL-VERSION         KUBELET-VERSION   CONTAINER-RUNTIME   HAS-UPDATES   HAS-DISRUPTIVE-UPDATES   CAASP-RELEASE-VERSION
master01   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master02   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master03   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker01   NotReady,SchedulingDisabled   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker02   NotReady,SchedulingDisabled   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5

Start all the workers, wait until you see them on status Ready,SchedulingDisabled:

skuba cluster status

NAME       STATUS     ROLE     OS-IMAGE                              KERNEL-VERSION         KUBELET-VERSION   CONTAINER-RUNTIME   HAS-UPDATES   HAS-DISRUPTIVE-UPDATES   CAASP-RELEASE-VERSION
master01   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master02   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master03   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker01   Ready,SchedulingDisabled   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker02   Ready,SchedulingDisabled   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5

Run the command kubectl uncordon <WORKER-NODE>, for each of the worker nodes, your cluster status should now be completely Ready:

skuba cluster status

NAME       STATUS     ROLE     OS-IMAGE                              KERNEL-VERSION         KUBELET-VERSION   CONTAINER-RUNTIME   HAS-UPDATES   HAS-DISRUPTIVE-UPDATES   CAASP-RELEASE-VERSION
master01   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master02   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
master03   Ready   master   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker01   Ready   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5
worker02   Ready   <none>   SUSE Linux Enterprise Server 15 SP2   5.3.18-24.15-default   v1.18.6           cri-o://1.18.2      yes           yes                      4.5

Bring back all your processes by scaling them up again:
```
kubectl scale --replicas=N -f deployment.yaml
```
or
```
kubectl scale deploy my-deployment --replicas=N
```
Note
Replace N with the number of replicas you want running.

2.8 Post Startup Activities #

Verify that all of your workloads and applications have resumed operation properly.

Print this page