Applies to SUSE CaaS Platform 4.5.2

15 Troubleshooting #

15.1 The supportconfig Tool
15.2 Cluster definition directory
15.3 Log collection
15.4 Debugging SLES Nodes provision
15.5 Debugging Cluster Deployment
15.6 Error x509: certificate signed by unknown authority
15.7 Error Invalid client credentials
15.8 Replacing a Lost Node
15.9 Rebooting an Undrained Node with RBD Volumes Mapped
15.10 ETCD Troubleshooting
15.11 Kubernetes debugging tips
15.12 AWS Deployment fails with cannot attach profile error

This chapter summarizes frequent problems that can occur while using SUSE CaaS Platform and their solutions.

Additionally, SUSE support collects problems and their solutions online at https://www.suse.com/support/kb/?id=SUSE_CaaS_Platform .

15.1 The `supportconfig` Tool #

As a first step for any troubleshooting/debugging effort, you need to find out the location of the cause of the problem. For this purpose we ship the supportconfig tool and plugin with SUSE CaaS Platform. With a simple command you can collect and compile a variety of details about your cluster to enable SUSE support to pinpoint the potential cause of an issue.

In case of problems, a detailed system report can be created with the supportconfig command line tool. It will collect information about the system, such as:

Current Kernel version
Hardware information
Installed packages
Partition setup
Cluster and node status

Tip

A full list of of the data collected by supportconfig can be found under https://github.com/SUSE/supportutils-plugin-suse-caasp/blob/master/README.md.

Important

To collect all the relevant logs, run the supportconfig command on all the master and worker nodes individually.

sudo supportconfig
sudo tar -xvJf /var/log/nts_*.txz
cd /var/log/nts*
sudo cat kubernetes.txt crio.txt

The result is a TAR archive of files. Each of the *.txz files should be given a name that can be used to identify which cluster node it was created on.

After opening a Service Request (SR), you can upload the TAR archives to SUSE Global Technical Support.

The data will help to debug the issue you reported and assist you in solving the problem. For details, see https://documentation.suse.com/sles/15-SP2/single-html/SLES-admin/#cha-adm-support.

15.2 Cluster definition directory #

Apart from the logs provided by running the supportconfig tool, an additional set of data might be required for debugging purposes. This information is located at the Management node, under your cluster definition directory. This folder contains important and sensitive information about your SUSE CaaS Platform cluster and it’s the one from where you issue skuba commands.

Warning

If the problem you are facing is related to your production environment, do not upload the admin.conf as this would expose access to your cluster to anyone in possession of the collected information! The same precautions apply for the pki directory, since this also contains sensitive information (CA cert and key).

In this case add --exclude='./<CLUSTER_NAME>/admin.conf' --exclude='./<CLUSTER_NAME>/pki/' to the command in the following example. Make sure to replace ./<CLUSTER_NAME> with the actual path of your cluster definition folder.

If you need to debug issues with your private certificates, a separate call with SUSE support must be scheduled to help you.

Create a TAR archive by compressing the cluster definition directory.

# Read the TIP above
# Move the admin.conf and pki directory to another safe location or exclude from packaging
tar -czvf cluster.tar.gz /home/user/<CLUSTER_NAME>/
# If the error is related to Terraform, please copy the terraform configuration files as well
tar -czvf cluster.tar.gz /home/user/my-terraform-configuration/

After opening a Service Request (SR), you can upload the TAR archive to SUSE Global Technical Support.

15.3 Log collection #

Some of these information are required for debugging certain cases. The data collected via supportconfig in such cases are following:

etcd.txt (master nodes)

curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/health
curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/members
curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/leader
curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/self
curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/store
curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/metrics

etcdcontainer=$(crictl ps --label io.kubernetes.container.name=etcd --quiet)

crictl exec $etcdcontainer sh -c \"ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt' ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt' ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key' ETCDCTL_API=3 etcdctl check perf\"

crictl logs -t $etcdcontainer

crictl stats --id $etcdcontainer

etcdpod=$(crictl ps | grep etcd | awk -F ' ' '{ print $9 }')

crictl inspectp $etcdpod

Note

For more information about etcd, refer to Section 15.10, “ETCD Troubleshooting”.

kubernetes.txt (all nodes)

export KUBECONFIG=/etc/kubernetes/admin.conf

kubectl version

kubectl api-versions

kubectl config view

kubectl -n kube-system get pods

kubectl get events --sort-by=.metadata.creationTimestamp

kubectl get nodes

kubectl get all -A

kubectl get nodes -o yaml

kubernetes-cluster-info.txt (all nodes)

export KUBECONFIG=/etc/kubernetes/admin.conf

# a copy of kubernetes logs /var/log/kubernetes
kubectl cluster-info dump --output-directory="/var/log/kubernetes"

kubelet.txt (all nodes)

systemctl status --full kubelet

journalctl -u kubelet

# a copy of kubernetes manifests /etc/kubernetes/manifests"
cat /var/lib/kubelet/config.yaml

oidc-gangway.txt (all nodes)

container=$(crictl ps --label io.kubernetes.container.name="oidc-gangway" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "oidc-gangway" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

oidc-dex.txt (worker nodes)

container=$(crictl ps --label io.kubernetes.container.name="oidc-dex" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "oidc-dex" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

cilium-agent.txt (all nodes)

container=$(crictl ps --label io.kubernetes.container.name="cilium-agent" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "cilium-agent" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

cilium-operator.txt (only from the worker node is runs)

container=$(crictl ps --label io.kubernetes.container.name="cilium-operator" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "cilium-operator" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

kured.txt (all nodes)

container=$(crictl ps --label io.kubernetes.container.name="kured" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "kured" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

coredns.txt (_worker nodes)

container=$(crictl ps --label io.kubernetes.container.name="coredns" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "coredns" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

kube-apiserver.txt (master nodes)

container=$(crictl ps --label io.kubernetes.container.name="kube-apiserver" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "kube-apiserver" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

kube-proxy.txt (all nodes)

container=$(crictl ps --label io.kubernetes.container.name="kube-proxy" --quiet)

crictl logs -t $container

crictl inspect $container
After skuba 4.2.2
pod=$(crictl ps | grep "kube-proxy" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

kube-scheduler.txt (master nodes)

container=$(crictl ps --label io.kubernetes.container.name="kube-scheduler" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "kube-scheduler" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

kube-controller-manager.txt (master nodes)

container=$(crictl ps --label io.kubernetes.container.name="kube-controller-manager" --quiet)

crictl logs -t $container

crictl inspect $container

pod=$(crictl ps | grep "kube-controller-manager" | awk -F ' ' '{ print $9 }')

crictl inspectp $pod

kube-system.txt (all nodes)

export KUBECONFIG=/etc/kubernetes/admin.conf

kubectl get all -n kube-system -o yaml

crio.txt (all_nodes)

crictl version

systemctl status --full crio.service

crictl info

crictl images

crictl ps --all

crictl stats --all

journalctl -u crio

# a copy of /etc/crictl.yaml

# a copy of /etc/sysconfig/crio

# a copy of every file under /etc/crio/

# Run the following three commands for every container using this loop:
for i in $(crictl  ps -a 2>/dev/null | grep -v "CONTAINER" | awk '{print $1}');
do
    crictl stats --id $i
    crictl logs $i
    crictl inspect $i
done

15.4 Debugging SLES Nodes provision #

If Terraform fails to setup the required SLES infrastructure for your cluster, please provide the configuration you applied in a form of a TAR archive.

Create a TAR archive by compressing the Terraform.

tar -czvf terraform.tar.gz /path/to/terraform/configuration

After opening a Service Request (SR), you can upload the TAR archive to Global Technical Support.

15.5 Debugging Cluster Deployment #

If the cluster deployment fails, please re-run the command again with setting verbosity level to 5 -v=5.

For example, if bootstraps the first master node of the cluster fails, re-run the command like

skuba node bootstrap --user sles --sudo --target <IP/FQDN> <NODE_NAME> -v=5

However, if the join procedure fails at the last final steps, re-running it might not help. To verify this, please list the current member nodes of your cluster and look for the one who failed.

kubectl get nodes

If the node that failed to join is nevertheless listed in the output as part of your cluster, then this is a bad indicator. This node cannot be reset back to a clean state anymore and it’s not safe to keep it online in this unknown state. As a result, instead of trying to fix its existing configuration either by hand or re-running the join/bootstrap command, we would highly recommend you to remove this node completely from your cluster and then replace it with a new one.

skuba node remove <NODE_NAME> --drain-timeout 5s

15.6 Error `x509: certificate signed by unknown authority` #

When interacting with Kubernetes, you might run into the situation where your existing configuration for the authentication has changed (cluster has been rebuild, certificates have been switched.) In such a case you might see an error message in the output of your CLI or Web browser.

x509: certificate signed by unknown authority

This message indicates that your current system does not know the Certificate Authority (CA) that signed the SSL certificates used for encrypting the communication to the cluster. You then need to add or update the Root CA certificate in your local trust store.

Obtain the root CA certificate from on of the Kubernetes cluster node, at the location /etc/kubernetes/pki/ca.crt
Copy the root CA certificate into your local machine directory /etc/pki/trust/anchors/
Update the cache for know CA certificates
```
sudo update-ca-certificates
```

15.7 Error `Invalid client credentials` #

When using Dex & Gangway for authentication, you might see the following error message in the Web browser output:

oauth2: cannot fetch token: 401 Unauthorized
Response: {"error":"invalid_client","error_description":"Invalid client credentials."}

This message indicates that your Kubernetes cluster Dex & Gangway client secret is out of sync.

15.7.1 Versions before SUSE CaaS Platform 4.2.2 #

Note

These steps apply to skuba ≤ 1.3.5

Please update the Dex & Gangway ConfigMap to use the same client secret.

kubectl -n kube-system get configmap oidc-dex-config -o yaml > oidc-dex-config.yaml
kubectl -n kube-system get configmap oidc-gangway-config -o yaml > oidc-gangway-config.yaml

Make sure the oidc’s secret in oidc-dex-config.yaml is the same as the clientSecret in oidc-gangway-config.yaml. Then, apply the updated ConfigMap.

kubectl replace -f oidc-dex-config.yaml
kubectl replace -f oidc-gangway-config.yaml

15.7.2 Versions after SUSE CaaS Platform 4.2.2 #

Note

These steps apply to skuba ≥ 1.4.1

If you have configured Dex via a kustomize patch, please update your patch to use secretEnv: OIDC_GANGWAY_CLIENT_SECRET. Change your patch as follows, from:

- id: oidc
  ...
  name: 'OIDC'
  secret: <client-secret>
  trustedPeers:
  - oidc-cli

- id: oidc
  ...
  name: 'OIDC'
  secretEnv: OIDC_GANGWAY_CLIENT_SECRET
  trustedPeers:
  - oidc-cli

Dex & Gangway will then use the same client secret.

15.8 Replacing a Lost Node #

If your cluster loses a node, for example due to failed hardware, remove the node as explained in Section 2.4, “Removing Nodes”. Then add a new node as described in Section 2.3, “Adding Nodes”.

15.9 Rebooting an Undrained Node with RBD Volumes Mapped #

Rebooting a cluster node always requires a preceding drain. In some cases, draining the nodes first might not be possible and some problem can occur during reboot if some RBD volumes are mapped to the nodes.

In this situation, apply the following steps.

Make sure kubelet and CRI-O are stopped:
```
systemctl stop kubelet crio
```
Unmount every RBD device /dev/rbd* before rebooting. For example:
```
umount -vAf /dev/rbd0
```

If there are several device mounted, this little script can be used to avoid manual unmounting:

#!/usr/bin/env bash

while grep "rbd" /proc/mounts > /dev/null 2>&1; do
  for dev in $(lsblk -p -o NAME | grep "rbd"); do
    if $(mountpoint -x $dev > /dev/null 2>&1); then
      echo ">>> umounting $dev"
      umount -vAf "$dev"
    fi
  done
done

15.10 ETCD Troubleshooting #

15.10.1 Introduction #

This document aims to describe debugging an etcd cluster.

The required etcd logs are part of the supportconfig, a utility that collects all the required information for debugging a problem. The rest of the document provides information on how you can obtain these information manually.

15.10.2 ETCD container #

ETCD is a distributed reliable key-value store for the most critical data of a distributed system. It is running only on the master nodes in a form a container application. For instance, in a cluster with 3 master nodes, it is expected to have 3 etcd instances as well:

kubectl get pods -n kube-system -l component=etcd
NAME                            READY   STATUS    RESTARTS   AGE
etcd-vm072044.qa.prv.suse.net   1/1     Running   1          7d
etcd-vm072050.qa.prv.suse.net   1/1     Running   1          7d
etcd-vm073033.qa.prv.suse.net   1/1     Running   1          7d

The specific configuration which etcd is using to start, is the following:

etcd \
      --advertise-client-urls=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2379 \
      --cert-file=/etc/kubernetes/pki/etcd/server.crt  \
      --client-cert-auth=true --data-dir=/var/lib/etcd \
      --initial-advertise-peer-urls=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2380 \
      --initial-cluster=vm072050.qa.prv.suse.net=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2380 \
      --key-file=/etc/kubernetes/pki/etcd/server.key \
      --listen-client-urls=https://127.0.0.1:2379,https://<YOUR_MASTER_NODE_IP_ADDRESS>:2379 \
      --listen-peer-urls=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2380 \
      --name=vm072050.qa.prv.suse.net \
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
      --peer-client-cert-auth=true \
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
      --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Note

For more information related to ETCD, we highly recommend you to read ETCD FAQ page.

15.10.3 logging #

Since etcd is running in a container, that means it is not controlled by systemd, thus any commands related to that (e.g. journalctl) will fail, therefore you need to use container debugging approach instead.

Note

To use the following commands, you need to connect (e.g. via SSH) to the master node where the etcd pod is running.

To see the etcd logs, connect to a Kubernetes master node and then run as root:

ssh sles@<MASTER_NODE>
sudo bash # connect as root
etcdcontainer=$(crictl ps --label io.kubernetes.container.name=etcd --quiet)
crictl logs -f $etcdcontainer

15.10.4 etcdctl #

etcdctl is a command line client for etcd. The new version of SUSE CaaS Platform is using the v3 API. For that, you need to make sure to set environment variable ETCDCTL_API=3 before using it. Apart from that, you need to provide the required keys and certificates for authentication and authorization, via ETCDCTL_CACERT, ETCDCTL_CERT and ETCDCTL_KEY environment variables. Last but not least, you need to also specify the endpoint via ETCDCTL_ENDPOINTS environment variable.

Example

To find out if your network and disk latency are fast enough, you can benchmark your node using the etcdctl check perf command. To do this, frist connect to a Kubernetes master node:

ssh sles@<MASTER_NODE>
sudo bash # login as root

and then run as root:

etcdcontainer=$(crictl ps --label io.kubernetes.container.name=etcd --quiet)
crictl exec $etcdcontainer sh -c \
"ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' \
ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt' \
ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt' \
ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key' \
ETCDCTL_API=3 \
etcdctl check perf"

15.10.5 curl as an alternative #

For most of the etcdctl commands, there is an alternative way to fetch the same information via curl. First you need to connect to the master node and then issue a curl command against the ETCD endpoint. Here’s an example of the information which supportconfig is collecting:

Health check:

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/health

Member list

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/members

Leader information

# available only from the master node where ETCD **leader** runs
sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/leader

Current member information

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/self

Statistics

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/store

Metrics

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/metrics

15.11 Kubernetes debugging tips #

General guidelines and instructions: https://v1-18.docs.kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/
Troubleshooting applications: https://v1-18.docs.kubernetes.io/docs/tasks/debug-application-cluster/debug-application
Troubleshooting clusters: https://v1-18.docs.kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster
Debugging pods: https://v1-18.docs.kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller
Debugging services: https://v1-18.docs.kubernetes.io/docs/tasks/debug-application-cluster/debug-service

15.12 AWS Deployment fails with `cannot attach profile` error #

For SUSE CaaS Platform to be properly deployed, you need to have proper IAM role, role policy and instance profile set up in AWS. Under normal circumstances Terraform will be invoked by a user with suitable permissions during deployment and automatically create these profiles. If your access permissions on the AWS account forbid Terraform from creating the profiles automatically, they must be created before attempting deployment.

15.12.1 Create IAM Role, Role Policy, and Instance Profile through AWS CLI #

Users who do not have permission to create IAM role, role policy, and instance profile using Terraform, devops should create them for you, using the instructions below:

STACK_NAME: Cluster Stack Name

Install AWS CLI:

sudo zypper --gpg-auto-import-keys install -y aws-cli

Setup AWS credentials:
```
aws configure
```

Prepare role policy:

cat <<*EOF* >"./<STACK_NAME>-trust-policy.json"
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
*EOF*

Prepare master instance policy:

cat <<*EOF* >"./<STACK_NAME>-master-role-trust-policy.json"
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeTags",
        "ec2:DescribeInstances",
        "ec2:DescribeRegions",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeVolumes",
        "ec2:CreateSecurityGroup",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:ModifyInstanceAttribute",
        "ec2:ModifyVolume",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CreateRoute",
        "ec2:DeleteRoute",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteVolume",
        "ec2:DetachVolume",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:DescribeVpcs",
        "elasticloadbalancing:AddTags",
        "elasticloadbalancing:AttachLoadBalancerToSubnets",
        "elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
        "elasticloadbalancing:CreateLoadBalancer",
        "elasticloadbalancing:CreateLoadBalancerPolicy",
        "elasticloadbalancing:CreateLoadBalancerListeners",
        "elasticloadbalancing:ConfigureHealthCheck",
        "elasticloadbalancing:DeleteLoadBalancer",
        "elasticloadbalancing:DeleteLoadBalancerListeners",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeLoadBalancerAttributes",
        "elasticloadbalancing:DetachLoadBalancerFromSubnets",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:ModifyLoadBalancerAttributes",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
        "elasticloadbalancing:AddTags",
        "elasticloadbalancing:CreateListener",
        "elasticloadbalancing:CreateTargetGroup",
        "elasticloadbalancing:DeleteListener",
        "elasticloadbalancing:DeleteTargetGroup",
        "elasticloadbalancing:DescribeListeners",
        "elasticloadbalancing:DescribeLoadBalancerPolicies",
        "elasticloadbalancing:DescribeTargetGroups",
        "elasticloadbalancing:DescribeTargetHealth",
        "elasticloadbalancing:ModifyListener",
        "elasticloadbalancing:ModifyTargetGroup",
        "elasticloadbalancing:RegisterTargets",
        "elasticloadbalancing:SetLoadBalancerPoliciesOfListener",
        "iam:CreateServiceLinkedRole",
        "kms:DescribeKey"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
*EOF*

Prepare worker instance policy:

cat <<*EOF* >"./<STACK_NAME>-worker-role-trust-policy.json"
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeRegions",
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetRepositoryPolicy",
        "ecr:DescribeRepositories",
        "ecr:ListImages",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    }
  ]
}
*EOF*

Create roles:

aws iam create-role --role-name <STACK_NAME>_cpi_master --assume-role-policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-trust-policy.json
aws iam create-role --role-name <STACK_NAME>_cpi_worker --assume-role-policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-trust-policy.json

Create instance role policies:

aws iam put-role-policy --role-name <STACK_NAME>_cpi_master --policy-name <STACK_NAME>_cpi_master --policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-master-role-trust-policy.json
aws iam put-role-policy --role-name <STACK_NAME>_cpi_worker --policy-name <STACK_NAME>_cpi_worker --policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-worker-role-trust-policy.json

Create instance profiles:

aws iam create-instance-profile --instance-profile-name <STACK_NAME>_cpi_master
aws iam create-instance-profile --instance-profile-name <STACK_NAME>_cpi_worker

Add role to instance profiles:

aws iam add-role-to-instance-profile --role-name <STACK_NAME>_cpi_master --instance-profile-name <STACK_NAME>_cpi_master
aws iam add-role-to-instance-profile --role-name <STACK_NAME>_cpi_worker --instance-profile-name <STACK_NAME>_cpi_worker

Print this page

Administration Guide

15 Troubleshooting #

15.1 The supportconfig Tool #

Tip

Important

15.2 Cluster definition directory #

Warning

15.3 Log collection #

Note

15.4 Debugging SLES Nodes provision #

15.5 Debugging Cluster Deployment #

15.6 Error x509: certificate signed by unknown authority #

15.7 Error Invalid client credentials #

15.7.1 Versions before SUSE CaaS Platform 4.2.2 #

Note

15.7.2 Versions after SUSE CaaS Platform 4.2.2 #

Note

15.8 Replacing a Lost Node #

15.9 Rebooting an Undrained Node with RBD Volumes Mapped #

15.10 ETCD Troubleshooting #

15.10.1 Introduction #

15.10.2 ETCD container #

Note

15.10.3 logging #

Note

15.10.4 etcdctl #

15.10.5 curl as an alternative #

15.11 Kubernetes debugging tips #

15.12 AWS Deployment fails with cannot attach profile error #

15.12.1 Create IAM Role, Role Policy, and Instance Profile through AWS CLI #

15.1 The `supportconfig` Tool #

15.6 Error `x509: certificate signed by unknown authority` #

15.7 Error `Invalid client credentials` #

15.12 AWS Deployment fails with `cannot attach profile` error #