Deployment Guide

Navigation←→

Applies to SUSE CaaS Platform 4.5.2

1 Requirements #

1.1 Platform
1.2 Nodes
1.3 Hardware
1.4 Networking

1.1 Platform #

Currently we support the following platforms to deploy on:

SUSE OpenStack Cloud 8
VMware ESXi 6.7
KVM
Bare Metal x86_64
Amazon Web Services (technological preview)

SUSE CaaS Platform itself is based on SLE 15 SP2.

The steps for obtaining the correct installation image for each platform type are detailed in the respective platform deployment instructions.

1.2 Nodes #

SUSE CaaS Platform consists of a number of (virtual) machines that run as a cluster.

You will need at least two machines:

1 master node
1 worker node

SUSE CaaS Platform 4.5.2 supports deployments with a single or multiple master nodes. Production environments must be deployed with multiple master nodes for resilience.

All communication to the cluster is done through a load balancer talking to the respective nodes. For that reason any failure tolerant environment must provide at least two load balancers for incoming communication.

The minimal viable failure tolerant production environment configuration consists of:

Cluster nodes: #

3 master nodes
2 worker nodes

Important: Dedicated Cluster Nodes

All cluster nodes must be dedicated (virtual) machines reserved for the purpose of running SUSE CaaS Platform.

Additional systems: #

Fault tolerant load balancing solution
(for example SUSE Linux Enterprise High Availability Extension with pacemaker and haproxy)
1 management workstation

1.3 Hardware #

1.3.1 Management Workstation #

In order to deploy and control a SUSE CaaS Platform cluster you will need at least one machine capable of running skuba. This typically is a regular desktop workstation or laptop running SLE 15 SP2 or later.

The skuba CLI package is available from the SUSE CaaS Platform module. You will need a valid SUSE Linux Enterprise and SUSE CaaS Platform subscription to install this tool on the workstation.

Important: Time Synchronization

It is vital that the management workstation runs an NTP client and that time synchronization is configured to the same NTP servers, which you will use later to synchronize the cluster nodes.

1.3.2 Storage Sizing #

The storage sizes in the following lists are absolute minimum configurations.

Sizing of the storage for worker nodes depends largely on the expected amount of container images, their size and change rate. The basic operating system for all nodes might also include snapshots (when using btrfs) that can quickly fill up existing space.

We recommend provisioning a separate storage partition for container images on each (worker) node that can be adjusted in size when needed. Storage for /var/lib/containers on the worker nodes should be approximately 50GB in addition to the base OS storage.

1.3.2.1 Master Nodes #

Up to 5 worker nodes (minimum):

Storage: 50 GB+
(v)CPU: 2
RAM: 4 GB
Network: Minimum 1Gb/s (faster is preferred)

Up to 10 worker nodes:

Storage: 50 GB+
(v)CPU: 2
RAM: 8 GB
Network: Minimum 1Gb/s (faster is preferred)

Up to 100 worker nodes:

Storage: 50 GB+
(v)CPU: 4
RAM: 16 GB
Network: Minimum 1Gb/s (faster is preferred)

Up to 250 worker nodes:

Storage: 50 GB+
(v)CPU: 8
RAM: 16 GB
Network: Minimum 1Gb/s (faster is preferred)

Important

Using a minimum of 2 (v)CPUs is a hard requirement, deploying a cluster with less processing units is not possible.

1.3.2.2 Worker nodes #

Important

The worker nodes must have sufficient memory, CPU and disk space for the Pods/containers/applications that are planned to be hosted on these workers.

A worker node requires the following resources:

CPU cores: 1.250
RAM: 1.2 GB

Based on these values, the minimal configuration of a worker node is:

Storage: Depending on workloads, minimum 20-30 GB to hold the base OS and required packages. Mount additional storage volumes as needed.
(v)CPU: 2
RAM: 2 GB
Network: Minimum 1Gb/s (faster is preferred)

Calculate the size of the required (v)CPU by adding up the base requirements, the estimated additional essential cluster components (logging agent, monitoring agent, configuration management, etc.) and the estimated CPU workloads:

1.250 (base requirements) + 0.250 (estimated additional cluster components) + estimated workload CPU requirements

Calculate the size of the RAM using a similar formula:

1.2 GB (base requirements) + 500 MB (estimated additional cluster components) + estimated workload RAM requirements

Note

These values are provided as a guide to work in most cases. They may vary based on the type of the running workloads.

1.3.3 Storage Performance #

For master nodes you must ensure storage performance of at least 50 to 500 sequential IOPS with disk bandwidth depending on your cluster size. It is highly recommended to use SSD.

"Typically 50 sequential IOPS (for example, a 7200 RPM disk) is required.
For heavily loaded clusters, 500 sequential IOPS (for example, a typical local SSD
or a high performance virtualized block device) is recommended."

"Typically 10MB/s will recover 100MB data within 15 seconds.
For large clusters, 100MB/s or higher is suggested for recovering 1GB data
within 15 seconds."

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#disks

This is extremely important to ensure a proper functioning of the critical component etcd.

It is possible to preliminary validate these requirements by using fio. This tool allows us to simulate etcd I/O (input/output) and to find out from the output statistics whether or not the storage is suitable.

Install the tool:
```
zypper in -y fio
```
Run the testing:
```
fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-etcd-dir --size=22m --bs=2300 --name=test-etcd-io
```
- Replace test-etcd-dir with a directory located on the same disk as the incoming etcd data under /var/lib/etcd

From the outputs, the interesting part is fsync/fdatasync/sync_file_range where the values are expressed in microseconds (usec). A disk is considered sufficient when the value of the 99.00th percentile is below 10000usec (10ms).

Be careful though, this benchmark is for etcd only and does not take into consideration external disk usage. This means that a value slightly under 10ms should be taken with precaution as other workloads will have an impact on the disks.

Warning

If the storage is very slow, the values can be expressed directly in milliseconds.

Let’s see two different examples:

[...]
  fsync/fdatasync/sync_file_range:
    sync (usec): min=251, max=1894, avg=377.78, stdev=69.89
    sync percentiles (usec):
     |  1.00th=[  273],  5.00th=[  285], 10.00th=[  297], 20.00th=[  330],
     | 30.00th=[  343], 40.00th=[  355], 50.00th=[  367], 60.00th=[  379],
     | 70.00th=[  400], 80.00th=[  424], 90.00th=[  465], 95.00th=[  506],
     | 99.00th=[  594], 99.50th=[  635], 99.90th=[  725], 99.95th=[  742], 1
     | 99.99th=[ 1188]
[...]

1	Here we get a value of 594usec (0.5ms) so the storage meets the requirements.

[...]
  fsync/fdatasync/sync_file_range:
    sync (msec): min=10, max=124, avg=17.62, stdev= 3.38
    sync percentiles (usec):
     |  1.00th=[11731],  5.00th=[11994], 10.00th=[12911], 20.00th=[16712],
     | 30.00th=[17695], 40.00th=[17695], 50.00th=[17695], 60.00th=[17957],
     | 70.00th=[17957], 80.00th=[17957], 90.00th=[19530], 95.00th=[22676],
     | 99.00th=[28705], 99.50th=[30016], 99.90th=[41681], 99.95th=[59507], 1
     | 99.99th=[89654]
[...]

1	Here we get a value of 28705usec (28ms) so the storage clearly does not meet the requirements.

1.4 Networking #

The management workstation needs at least the following networking permissions:

SSH access to all machines in the cluster
Access to the apiserver (the load balancer should expose it, port 6443), that will in turn talk to any master in the cluster
Access to Dex on the configured NodePort (the load balancer should expose it, port 32000) so when the OIDC token has expired, kubectl can request a new token using the refresh token

Important

It is good security practice not to expose the kubernetes API server on the public internet. Use network firewalls that only allow access from trusted subnets.

1.4.1 Sub-Network Sizing #

Important

The service subnet and pod subnet must not overlap.

Please plan generously for workload and the expected size of the networks before bootstrapping.

The default pod subnet is 10.244.0.0/16. It allows for 65536 IP addresses overall. Assignment of CIDR’s is by default /24 (254 usable IP addresses per node).

The default node allocation of /24 means a hard cluster node limit of 256 since this is the number of /24 ranges that fit in a /16 range.

Depending on the size of the nodes that you are planning to use (in terms of resources), or on the number of nodes you are planning to have, the CIDR can be adjusted to be bigger on a per node basis but the cluster would accommodate less nodes overall.

If you are planning to use more or less pods per node or have a higher number of nodes, you can adjust these settings to match your requirements. Please make sure that the networks are suitably sized to adjust to future changes in the cluster.

You can also adjust the service subnet size, this subnet must not overlap with the pod CIDR, and it should be big enough to accommodate all services.

For more advanced network requirements please refer to: https://docs.cilium.io/en/v1.6/concepts/ipam/#address-management

1.4.2 Ports #

Node	Port	Protocol	Accessibility	Description
All nodes	22	TCP	Internal	SSH (required in public clouds)
	4240	TCP	Internal	Cilium health check
	8472	UDP	Internal	Cilium VXLAN
	10250	TCP	Internal	Kubelet (API server → kubelet communication)
	10256	TCP	Internal	kube-proxy health check
	30000 - 32767	TCP + UDP	Internal	Range of ports used by Kubernetes when allocating services of type `NodePort`
	32000	TCP	External	Dex (OIDC Connect)
	32001	TCP	External	Gangway (RBAC Authenticate)
Masters	2379	TCP	Internal	etcd (client communication)
	2380	TCP	Internal	etcd (server-to-server traffic)
	6443	TCP	Internal / External	Kubernetes API server

1.4.3 IP Addresses #

Warning

Using IPv6 addresses is currently not supported.

All nodes must be assigned static IPv4 addresses, which must not be changed manually afterwards.

Important

Plan carefully for required IP ranges and future scenarios as it is not possible to reconfigure the IP ranges once the deployment is complete.

1.4.4 IP Forwarding #

The Kubernetes networking model requires that your nodes have IP forwarding enabled in the kernel. skuba checks this value when installing your cluster and installs a rule in /etc/sysctl.d/90-skuba-net-ipv4-ip-forward.conf to make it persistent.

Other software can potentially install rules with higher priority overriding this value and causing machines to not behave as expected after rebooting.

You can manually check if this is enabled using the following command:

# sysctl net.ipv4.ip_forward

net.ipv4.ip_forward = 1

net.ipv4.ip_forward must be set to 1. Additionally, you can check in what order persisted rules are processed by running sysctl --system -a.

1.4.5 Networking Whitelist #

Besides the SUSE provided packages and containers, SUSE CaaS Platform is typically used with third party provided containers and charts.

The following SUSE provided resources must be available:

URL	Name	Purpose
scc.suse.com	SUSE Customer Center	Allow registration and license activation
registry.suse.com	SUSE container registry	Provide container images
*.cloudfront.net	Cloudfront	CDN/distribution backend for `registry.suse.com`
kubernetes-charts.suse.com	SUSE helm charts repository	Provide helm charts
updates.suse.com	SUSE package update channel	Provide package updates

If you wish to use Upstream / Third-Party resources, please also allow the following:

URL	Name	Purpose
k8s.gcr.io	Google Container Registry	Provide container images
kubernetes-charts.storage.googleapis.com	Google Helm charts repository	Provide helm charts
docker.io	Docker Container Registry	Provide container images
quay.io	Red Hat Container Registry	Provide container images

Please note that not all installation scenarios will need all of these resources.

Note

If you are deploying into an air gap scenario, you must ensure that the resources required from these locations are present and available on your internal mirror server.

1.4.6 Communication #

Please make sure that all your Kubernetes components can communicate with each other. This might require the configuration of routing when using multiple network adapters per node.

Refer to: https://v1-18.docs.kubernetes.io/docs/setup/independent/install-kubeadm/#check-network-adapters.

Configure firewall and other network security to allow communication on the default ports required by Kubernetes: https://v1-18.docs.kubernetes.io/docs/setup/independent/install-kubeadm/#check-required-ports

1.4.7 Performance #

All master nodes of the cluster must have a minimum 1Gb/s network connection to fulfill the requirements for etcd.

"1GbE is sufficient for common etcd deployments. For large etcd clusters,
a 10GbE network will reduce mean time to recovery."

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#network

1.4.8 Security #

Do not grant access to the kubeconfig file or any workstation configured with this configuration to unauthorized personnel. In the current state, full administrative access is granted to the cluster.

Authentication is done via the kubeconfig file generated during deployment. This file will grant full access to the cluster and all workloads. Apply best practices for access control to workstations configured to administer the SUSE CaaS Platform cluster.

The SUSE CaaS Platform leverages Kubernetes role-based access control (RBAC) for authentication and will need to have an external authentication server such as LDAP, Active Directory, or similar to validate the user’s entity and grant different user roles or cluster role permission.

1.4.9 Replicas #

Some addon services are desired to be highly available. These services require enough cluster nodes available to run replicas of their services.

When the cluster is deployed with enough nodes for replica sizes, those service distributions will be balanced across the cluster.

For clusters deployed with a node number lower than the default replica sizes, services will still try to find a suitable node to run on. However it is likely you will see services all running on the same nodes, defeating the purpose of high availability.

You can check the deployment replica size after node bootstrap. The number of cluster nodes should be equal or greater than the DESIRED replica size.

kubectl get rs -n kube-system

After deployment, if the number of healthy nodes falls below the number required for fulfilling the replica sizing, service replicas will show in Pending state until either the unhealthy node recovers or a new node is joined to cluster.

The following describes two methods for replica management if you wish to work with a cluster below the default replica size requirement.

1.4.9.1 Update replica number #

One method is to update the number of overall replicas being created by a service. Please consult the documentation of your respective service what the replica limits for proper high availability are. In case the replica number is too high for the cluster, you must increase the cluster size to provide more resources.

Update deployment replica size before node joining.
Note
You can use the same steps to increase the replica size again if more resources become available later on.
```
kubectl -n kube-system scale --replicas=<DESIRED_REPLICAS> deployment <NAME>
```
Join new nodes.

1.4.9.2 Re-distribute replicas #

When multiple replicas are running on the same pod you will want to redistribute those manually to ensure proper high availability,

Find the pod for re-distribution. Check the NAME and NODE column for duplicated pods.
```
kubectl -n kube-system get pod -o wide
```
Delete duplicated pod. This will trigger another pod creation.
```
kubectl -n kube-system delete pod <POD_NAME>
```

Print this page