This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.
Cloud stacks are complex, and debugging deployment issues often requires digging through multiple layers to find the information you need. Remember that the KubeCF releases must be deployed in the correct order, and that each release must deploy successfully, with no failed pods, before deploying the next release.
Before proceeding with in depth troubleshooting, ensure the following have been met as defined in the Support Statement at Section 5.2, “Platform Support”.
The Kubernetes cluster satisfies the Requirements listed here at https://documentation.suse.com/suse-cap/2.1.1/html/cap-guides/cha-cap-depl-kube-requirements.html#sec-cap-changes-kube-reqs.
The kube-ready-state-check.sh script has been run on
the target Kubernetes cluster and does not show any configuration problems.
A SUSE Services or Sales Engineer has verified that SUSE Cloud Application Platform works correctly on the target Kubernetes cluster.
There are two types of logs in a deployment of SUSE Cloud Application Platform, applications logs and component logs. The following provides a brief overview of each log type and how to retrieve them for monitoring and debugging use.
Application logs provide information specific to a given application that has been deployed to your Cloud Application Platform cluster and can be accessed through:
The cf CLI using the cf logs command
The application's log stream within the Stratos console
Access to logs for a given component of your Cloud Application Platform deployment can be obtained by:
The kubectl logs command
The following example retrieves the logs of the router
container of router-0 pod in the kubecf
namespace
tux > kubectl logs --namespace kubecf router-0 routerDirect access to the log files using the following:
Open a shell to the container of the component using the
kubectl exec command
Navigate to the logs directory at
/var/vcap/sys/logs, at which point there will be
subdirectories containing the log files for access.
tux > kubectl exec --stdin --tty --namespace kubecf router-0 /bin/bash
router/0:/# cd /var/vcap/sys/log
router/0:/var/vcap/sys/log# ls -R
.:
gorouter loggregator_agent
./gorouter:
access.log gorouter.err.log gorouter.log post-start.err.log post-start.log
./loggregator_agent:
agent.log
If you ever need to request support, or just want to generate detailed
system information and logs, use the supportconfig
utility. Run it with no options to collect basic system information, and
also cluster logs including Docker, etcd, flannel, and Velum.
supportconfig may give you all the information you need.
supportconfig -h prints the options. Read the "Gathering
System Information for Support" chapter in any SUSE Linux Enterprise Administration Guide to
learn more.
A deployment step seems to take too long, or you see that some pods are not in a ready state hours after all the others are ready, or a pod shows a lot of restarts. This example shows not-ready pods many hours after the others have become ready:
tux > kubectl get pods --namespace kubecf
NAME READY STATUS RESTARTS AGE
router-3137013061-wlhxb 0/1 Running 0 16h
routing-api-0 0/1 Running 0 16h
The Running status means the pod is bound to a node and
all of its containers have been created. However, it is not
Ready, which means it is not ready to service requests.
Use kubectl to print a detailed description of pod events
and status:
tux > kubectl describe pod --namespace kubecf router-0This prints a lot of information, including IP addresses, routine events, warnings, and errors. You should find the reason for the failure in this output.
During deployment, pods are spawned over time, starting with a single
pod whose name stars with ig-. This pod will eventually
disappear and will be replaced by other pods whose progress
then can be followed as usual.
The whole process can take around 20—30 minutes to finish.
The initial stage may look like this:
tux > kubectl get pods --namespace kubecf
ig-kubecf-f9085246244fbe70-jvg4z 1/21 Running 0 8m28sLater the progress may look like this:
NAME READY STATUS RESTARTS AGE adapter-0 4/4 Running 0 6m45s api-0 0/15 Init:30/63 0 6m38s bits-0 0/6 Init:8/15 0 6m34s bosh-dns-7787b4bb88-2wg9s 1/1 Running 0 7m7s bosh-dns-7787b4bb88-t42mh 1/1 Running 0 7m7s cc-worker-0 0/4 Init:5/9 0 6m36s credhub-0 0/5 Init:6/11 0 6m33s database-0 2/2 Running 0 6m36s diego-api-0 6/6 Running 2 6m38s doppler-0 0/9 Init:7/16 0 6m40s eirini-0 9/9 Running 0 6m37s log-api-0 0/7 Init:6/13 0 6m35s nats-0 4/4 Running 0 6m39s router-0 0/5 Init:5/11 0 6m33s routing-api-0 0/4 Init:5/10 0 6m42s scheduler-0 0/8 Init:8/17 0 6m35s singleton-blobstore-0 0/6 Init:6/11 0 6m46s tcp-router-0 0/5 Init:5/11 0 6m37s uaa-0 0/6 Init:8/13 0 6m36s
There may be times when you want to delete and rebuild a deployment, for
example when there are errors in your kubecf-config-values.yaml file, you wish to
test configuration changes, or a deployment fails and you want to try it again.
Remove the kubecf release. All resources associated with
the release of the suse/kubecf chart will be removed.
Replace the example release name with the one used during your installation.
tux > helm uninstall kubecf
Remove the kubecf namespace. Replace with the namespace
where the suse/kubecf chart was installed.
tux > kubectl delete namespace kubecf
Remove the cf-operator release. All resources associated
with the release of the suse/cf-operator chart will be
removed. Replace the example release name with the one used during your
installation.
tux > helm uninstall cf-operator
Remove the cf-operator namespace. Replace with the namespace
where the suse/cf-operator chart was installed.
tux > kubectl delete namespace cf-operatorVerify all of the releases are removed.
tux > helm list --all-namespacesVerify all of the namespaces are removed.
tux > kubectl get namespaces
You can safely query with kubectl to get information
about resources inside your Kubernetes cluster. kubectl cluster-info
dump | tee clusterinfo.txt outputs a large amount of information
about the Kubernetes master and cluster services to a text file.
The following commands give more targeted information about your cluster.
List all cluster resources:
tux > kubectl get all --all-namespacesList all of your running pods:
tux > kubectl get pods --all-namespacesList all of your running pods, their internal IP addresses, and which Kubernetes nodes they are running on:
tux > kubectl get pods --all-namespaces --output wideSee all pods, including those with Completed or Failed statuses:
tux > kubectl get pods --show-all --all-namespacesList pods in one namespace:
tux > kubectl get pods --namespace kubecfGet detailed information about one pod:
tux > kubectl describe --namespace kubecf po/diego-cell-0Read the log file of a pod:
tux > kubectl logs --namespace kubecf po/diego-cell-0List all Kubernetes nodes, then print detailed information about a single node:
tux >kubectl get nodestux >kubectl describe node 6a2752b6fab54bb889029f60de6fa4d5.infra.caasp.local
List all containers in all namespaces, formatted for readability:
tux > kubectl get pods --all-namespaces --output jsonpath="{..image}" |\
tr -s '[[:space:]]' '\n' |\
sort |\
uniq -cThese two commands check node capacities, to verify that there are enough resources for the pods:
tux >kubectl get nodes --output yaml | grep '\sname\|cpu\|memory'tux >kubectl get nodes --output json | \ jq '.items[] | {name: .metadata.name, cap: .status.capacity}'
When switching back to Diego from Eirini, the error below can occur:
tux > helm install kubecf suse/kubecf --namespace kubecf --values kubecf-config-values.yaml
Error: admission webhook "validate-boshdeployment.quarks.cloudfoundry.org" denied the request: Failed to resolve manifest: Failed to interpolate ops 'kubecf-user-provided-properties' for manifest 'kubecf': Applying ops on manifest obj failed in interpolator: Expected to find exactly one matching array item for path '/instance_groups/name=eirini' but found 0
To avoid this error, remove the eirini-persi-broker configuration
before running the command.
When running a Helm command, an error occurs stating that a namespace does not
exist. To avoid this error, create the namespace manually with kubectl; before
running the command:
tux > kubectl create namespace nameThe log-cache component currently has a memory allocation issue where the node memory available is reported instead of the one assigned to the container under cgroups. In such a situation, log-cache would start allocating memory based on these values, causing a varying range of issues (OOMKills, performance degradation, etc.). To address this issue, node affinity must be used to tie log-cache to nodes of a uniform size, and then declaring the cache percentage based on that number. A limit of 3% has been identified as sufficient.
Add the following to your kubecf-config-values.yaml. In the node affinity
configuration, the values for key and
values may need to be changed depending on how notes in
your cluster are labeled. For more information on labels, see
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels.
properties:
log-cache:
log-cache:
memory_limit_percent: 3
operations:
inline:
- type: replace
path: /instance_groups/name=log-cache/env?/bosh/agent/settings/affinity
value:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- LABEL_VALUE_OF_NODE