Troubleshooting | Deployment, Administration, and User Guides

Deployment, Administration, and User Guides › Troubleshooting › Troubleshooting

ContentsContents

Deployment, Administration, and User Guides

Navigation←→

Applies to SUSE Cloud Application Platform 2.1.1

27 Troubleshooting #

File Name: cap_troubleshooting.xml
ID: no ID found

27.1 Logging
27.2 Using Supportconfig
27.3 Deployment Is Taking Too Long
27.4 Deleting and Rebuilding a Deployment
27.5 Querying with Kubectl
27.6 Admission webhook denied
27.7 Namespace does not exist
27.8 Log-cache Memory Allocation Issue

Cloud stacks are complex, and debugging deployment issues often requires digging through multiple layers to find the information you need. Remember that the KubeCF releases must be deployed in the correct order, and that each release must deploy successfully, with no failed pods, before deploying the next release.

Before proceeding with in depth troubleshooting, ensure the following have been met as defined in the Support Statement at Section 5.2, “Platform Support”.

The Kubernetes cluster satisfies the Requirements listed here at https://documentation.suse.com/suse-cap/2.1.1/html/cap-guides/cha-cap-depl-kube-requirements.html#sec-cap-changes-kube-reqs.
The kube-ready-state-check.sh script has been run on the target Kubernetes cluster and does not show any configuration problems.
A SUSE Services or Sales Engineer has verified that SUSE Cloud Application Platform works correctly on the target Kubernetes cluster.

27.1 Logging #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-logging

There are two types of logs in a deployment of SUSE Cloud Application Platform, applications logs and component logs. The following provides a brief overview of each log type and how to retrieve them for monitoring and debugging use.

Application logs provide information specific to a given application that has been deployed to your Cloud Application Platform cluster and can be accessed through:
- The cf CLI using the cf logs command
- The application's log stream within the Stratos console
Access to logs for a given component of your Cloud Application Platform deployment can be obtained by:
- The kubectl logs command
  The following example retrieves the logs of the router container of router-0 pod in the kubecf namespace
```
tux > kubectl logs --namespace kubecf router-0 router
```
- Direct access to the log files using the following:
  1. Open a shell to the container of the component using the kubectl exec command
  2. Navigate to the logs directory at /var/vcap/sys/logs, at which point there will be subdirectories containing the log files for access.
    tux > kubectl exec --stdin --tty --namespace kubecf router-0 /bin/bash router/0:/# cd /var/vcap/sys/log router/0:/var/vcap/sys/log# ls -R .: gorouter loggregator_agent ./gorouter: access.log gorouter.err.log gorouter.log post-start.err.log post-start.log ./loggregator_agent: agent.log

27.2 Using Supportconfig #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-supportconfig

If you ever need to request support, or just want to generate detailed system information and logs, use the supportconfig utility. Run it with no options to collect basic system information, and also cluster logs including Docker, etcd, flannel, and Velum. supportconfig may give you all the information you need.

supportconfig -h prints the options. Read the "Gathering System Information for Support" chapter in any SUSE Linux Enterprise Administration Guide to learn more.

27.3 Deployment Is Taking Too Long #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-toolong

A deployment step seems to take too long, or you see that some pods are not in a ready state hours after all the others are ready, or a pod shows a lot of restarts. This example shows not-ready pods many hours after the others have become ready:

tux > kubectl get pods --namespace kubecf
NAME                     READY STATUS    RESTARTS  AGE
router-3137013061-wlhxb  0/1   Running   0         16h
routing-api-0            0/1   Running   0         16h

The Running status means the pod is bound to a node and all of its containers have been created. However, it is not Ready, which means it is not ready to service requests. Use kubectl to print a detailed description of pod events and status:

tux > kubectl describe pod --namespace kubecf router-0

This prints a lot of information, including IP addresses, routine events, warnings, and errors. You should find the reason for the failure in this output.

Important

During deployment, pods are spawned over time, starting with a single pod whose name stars with ig-. This pod will eventually disappear and will be replaced by other pods whose progress then can be followed as usual.

The whole process can take around 20—30 minutes to finish.

The initial stage may look like this:

tux > kubectl get pods --namespace kubecf
ig-kubecf-f9085246244fbe70-jvg4z   1/21    Running             0          8m28s

Later the progress may look like this:

NAME                        READY   STATUS       RESTARTS   AGE
adapter-0                   4/4     Running      0          6m45s
api-0                       0/15    Init:30/63   0          6m38s
bits-0                      0/6     Init:8/15    0          6m34s
bosh-dns-7787b4bb88-2wg9s   1/1     Running      0          7m7s
bosh-dns-7787b4bb88-t42mh   1/1     Running      0          7m7s
cc-worker-0                 0/4     Init:5/9     0          6m36s
credhub-0                   0/5     Init:6/11    0          6m33s
database-0                  2/2     Running      0          6m36s
diego-api-0                 6/6     Running      2          6m38s
doppler-0                   0/9     Init:7/16    0          6m40s
eirini-0                    9/9     Running      0          6m37s
log-api-0                   0/7     Init:6/13    0          6m35s
nats-0                      4/4     Running      0          6m39s
router-0                    0/5     Init:5/11    0          6m33s
routing-api-0               0/4     Init:5/10    0          6m42s
scheduler-0                 0/8     Init:8/17    0          6m35s
singleton-blobstore-0       0/6     Init:6/11    0          6m46s
tcp-router-0                0/5     Init:5/11    0          6m37s
uaa-0                       0/6     Init:8/13    0          6m36s

27.4 Deleting and Rebuilding a Deployment #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-rebuild-depl

There may be times when you want to delete and rebuild a deployment, for example when there are errors in your kubecf-config-values.yaml file, you wish to test configuration changes, or a deployment fails and you want to try it again.

Remove the kubecf release. All resources associated with the release of the suse/kubecf chart will be removed. Replace the example release name with the one used during your installation.
```
tux > helm uninstall kubecf
```
Remove the kubecf namespace. Replace with the namespace where the suse/kubecf chart was installed.
```
tux > kubectl delete namespace kubecf
```
Remove the cf-operator release. All resources associated with the release of the suse/cf-operator chart will be removed. Replace the example release name with the one used during your installation.
```
tux > helm uninstall cf-operator
```
Remove the cf-operator namespace. Replace with the namespace where the suse/cf-operator chart was installed.
```
tux > kubectl delete namespace cf-operator
```
Verify all of the releases are removed.
```
tux > helm list --all-namespaces
```
Verify all of the namespaces are removed.
```
tux > kubectl get namespaces
```

27.5 Querying with Kubectl #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-kubectl-queries

You can safely query with kubectl to get information about resources inside your Kubernetes cluster. kubectl cluster-info dump | tee clusterinfo.txt outputs a large amount of information about the Kubernetes master and cluster services to a text file.

The following commands give more targeted information about your cluster.

List all cluster resources:
```
tux > kubectl get all --all-namespaces
```
List all of your running pods:
```
tux > kubectl get pods --all-namespaces
```
List all of your running pods, their internal IP addresses, and which Kubernetes nodes they are running on:
```
tux > kubectl get pods --all-namespaces --output wide
```
See all pods, including those with Completed or Failed statuses:
```
tux > kubectl get pods --show-all --all-namespaces
```

List pods in one namespace:

tux > kubectl get pods --namespace kubecf

Get detailed information about one pod:

tux > kubectl describe --namespace kubecf po/diego-cell-0

Read the log file of a pod:

tux > kubectl logs --namespace kubecf po/diego-cell-0

List all Kubernetes nodes, then print detailed information about a single node:

tux > kubectl get nodes
tux > kubectl describe node 6a2752b6fab54bb889029f60de6fa4d5.infra.caasp.local

List all containers in all namespaces, formatted for readability:

tux > kubectl get pods --all-namespaces --output jsonpath="{..image}" |\
tr -s '[[:space:]]' '\n' |\
sort |\
uniq -c

These two commands check node capacities, to verify that there are enough resources for the pods:

tux > kubectl get nodes --output yaml | grep '\sname\|cpu\|memory'
tux > kubectl get nodes --output json | \
jq '.items[] | {name: .metadata.name, cap: .status.capacity}'

27.6 Admission webhook denied #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-admission-webhook

When switching back to Diego from Eirini, the error below can occur:

tux > helm install kubecf suse/kubecf --namespace kubecf --values kubecf-config-values.yaml
Error: admission webhook "validate-boshdeployment.quarks.cloudfoundry.org" denied the request: Failed to resolve manifest: Failed to interpolate ops 'kubecf-user-provided-properties' for manifest 'kubecf': Applying ops on manifest obj failed in interpolator: Expected to find exactly one matching array item for path '/instance_groups/name=eirini' but found 0

To avoid this error, remove the eirini-persi-broker configuration before running the command.

27.7 Namespace does not exist #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-helm-namespace

When running a Helm command, an error occurs stating that a namespace does not exist. To avoid this error, create the namespace manually with kubectl; before running the command:

tux > kubectl create namespace name

27.8 Log-cache Memory Allocation Issue #

File Name: cap_troubleshooting.xml
ID: sec-cap-tbl-log-cache-memory

The log-cache component currently has a memory allocation issue where the node memory available is reported instead of the one assigned to the container under cgroups. In such a situation, log-cache would start allocating memory based on these values, causing a varying range of issues (OOMKills, performance degradation, etc.). To address this issue, node affinity must be used to tie log-cache to nodes of a uniform size, and then declaring the cache percentage based on that number. A limit of 3% has been identified as sufficient.

Add the following to your kubecf-config-values.yaml. In the node affinity configuration, the values for key and values may need to be changed depending on how notes in your cluster are labeled. For more information on labels, see https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels.

properties:
  log-cache:
    log-cache:
      memory_limit_percent: 3

operations:
  inline:
  - type: replace
    path: /instance_groups/name=log-cache/env?/bosh/agent/settings/affinity
    value:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - LABEL_VALUE_OF_NODE

Print this page