Administration Guide › GPU-Dependent Workloads

Applies to SUSE CaaS Platform 4.5.2

11 GPU-Dependent Workloads #

11.1 NVIDIA GPUs

11.1 NVIDIA GPUs #

Important

This feature is offered as a "tech preview".

We release this as a tech-preview in order to get early feedback from our customers. Tech previews are largely untested, unsupported, and thus not ready for production use.

That said, we strongly believe this technology is useful at this stage in order to make the right improvements based on your feedback. A fully supported, production-ready release is planned for a later point in time.

Graphics Processing Units (GPUs) provide a powerful way to run compute-intensive workloads such as machine learning pipelines. SUSE’s CaaS Platform supports scheduling GPU-dependent workloads on NVIDIA GPUs as a technical preview. This section illustrates how to prepare your host machine to expose GPU devices to your containers, and how to configure Kubernetes to schedule GPU-dependent workloads.

11.1.1 Prepare the host machine #

11.1.1.1 Install the GPU drivers #

Not every worker node in the cluster need have a GPU device present. On the nodes that do have one or more NVIDIA GPUs, install the drivers from NVIDIA’s repository.

# zypper addrepo --refresh https://download.nvidia.com/suse/sle15sp2/ nvidia
# zypper refresh
# zypper install nvidia-glG05 nvidia-computeG05

Note

For most modern NVIDIA GPUs, the G05 driver will support your device. Check NVIDIA’s documentation for your GPU device model.

11.1.1.2 Install the OCI hooks #

OCI hooks are a way for vendors or projects to inject executable actions into the lifecycle of a container managed by the container runtime (runc). SUSE provides an OCI hook for NVIDIA that enable the container runtime and therefor the kubelet and the Kubernetes scheduler to query the host system for the presence of a GPU device and access it directly. Install the hook on the worker nodes with GPUs:

# zypper install nvidia-container-toolkit

11.1.1.3 Test a GPU Container Image #

At this point, you should be able to run a container image that requires a GPU and directly access the device from the running container, for example using Podman:

# podman run docker.io/nvidia/cuda nvidia-smi

11.1.1.4 Troubleshooting #

At this point, you should be able to run a container image using a GPU. If that is not working, check the following:

Ensure your GPU is visible from the host system:

# lspci | grep -i nvidia
# nvidia-smi

Ensure the kernel modules are loaded:

# lsmod | grep nvidia

If they are not, try loading them explicitly and check dmesg for an error indicating why they are missing:

# nvidia-modprobe
# dmesg | tail

11.1.2 Configure Kubernetes #

11.1.2.1 Install the Device Plugin #

The Kubernetes device plugin framework allows the kubelet to advertise system hardware resources that the Kubernetes scheduler can then use as hints to schedule workloads that require such devices. The Kubernetesdevice plugin from NVIDIA allows the kubelet to advertise NVIDIA GPUs it finds present on the worker node. Install the device plugin using kubectl:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml

11.1.2.2 Taint GPU Workers #

In a heterogeneous cluster, it may be preferable to prevent scheduling pods that do not require a GPU on nodes with a GPU in order to ensure that GPU workloads are not competing for time on the hardware they need to run. To accomplish this, add a taint to the worker nodes that have GPUs:

$ kubectl taint nodes worker0 nvidia.com/gpu=:PreferNoSchedule

$ kubectl taint nodes worker0 nvidia.com/gpu=:NoSchedule

See the Kubernetes documentation on taints and tolerations for a discussion on the considerations for using the NoSchedule or PreferNoSchedule effects. If you use the NoSchedule effect, you must also add the appropriate toleration to infrastructure-critical Daemonsets that must run on all nodes, such as the kured, kube-proxy, and cilium Daemonsets.

Note

The ExtendedResourceToleration admission controller is enabled on SUSE CaaS Platform v5 by default. This is a mutating admission controller that reviews all pod requests and adds tolerations to any pod that requests an extended resource advertised by a device plugin. For the NVIDIA GPU device plugin, it will automatically add the nvidia.com/gpu toleration to pods that request the nvidia.com/gpu resource, so you will not need to add this toleration explicitly for every GPU workload.

11.1.2.3 Test a GPU Workload #

To test your installation you can create a pod that requests GPU devices:

$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
EOF

This example requests a total of two GPUs for two containers. If two GPUs are available on a worker in your cluster, this pod will be scheduled to that worker.

11.1.2.4 Troubleshooting #

At this point, after a few moments your pod should transition to state "running". If it is not, check the following:

Examine the pod events for an indication of why it is not being scheduled:

$ kubectl describe pod gpu-pod

Examine the events for the device plugin daemonset for any issues:

$ kubectl describe daemonset nvidia-device-plugin-daemonset --namespace kube-system

Check the logs of each pod in the daemonset running on a worker that has a GPU:

$ kubectl logs -l name=nvidia-device-plugin-ds --namespace kube-system

Check the kubelet log on the worker node that has a GPU. This may indicate errors the container runtime had executing the OCI hook command:

# journalctl -u kubelet

11.1.3 Monitoring #

If you have configured Monitoring for your cluster, you may want to use NVIDIA’s Data Center GPU Manager (DCGM) to monitor your GPUs. DCGM integrates with the Prometheus and Grafana services configured for your cluster. Follow the steps below to configure the Prometheus exporter and Grafana dashboard for your NVIDIA GPUs.

11.1.3.1 Configure a Pod Security Policy #

The DCGM requires use of the hostPath volume type to access the kubelet socket on the host worker node. Create an appropriate Pod Security Policy and RBAC configuration to allow this:

$ kubectl apply -f - <<EOF
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: nvidia.dcgm
spec:
  privileged: false
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  allowedHostPaths:
    - pathPrefix: /var/lib/kubelet/pod-resources
  volumes:
  - hostPath
  - configMap
  - secret
  - emptyDir
  - downwardAPI
  - projected
  - persistentVolumeClaim
  - nfs
  - rbd
  - cephFS
  - glusterfs
  - fc
  - iscsi
  - cinder
  - gcePersistentDisk
  - awsElasticBlockStore
  - azureDisk
  - azureFile
  - vsphereVolume
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: nvidia:dcgm
rules:
- apiGroups:
  - policy
  resources:
  - podsecuritypolicies
  verbs:
  - use
  resourceNames:
  - nvidia.dcgm
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nvidia:dcgm
roleRef:
  kind: ClusterRole
  name: nvidia:dcgm
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: Group
  name: system:serviceaccounts:dcgm
EOF

11.1.3.2 Create the DCGM Exporter #

The DCGM exporter monitors GPUs on each worker node and exposes metrics that can be queried.

$ kubectl create namespace dcgm
$ kubectl create --namespace dcgm -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/dcgm-exporter.yaml

Check that the metrics are being collected:

$ NAME=$(kubectl get pods --namespace dcgm -l "app.kubernetes.io/name=dcgm-exporter" -o "jsonpath={ .items[0].metadata.name}")
$ kubectl port-forward $NAME 8080:9400
$ # in another terminal
$ curl http://127.0.0.1:8080/metrics

11.1.3.3 Configure Prometheus #

After deploying Prometheus as explained in Monitoring, configure Prometheus to monitor the DCGM pods. Gather the cluster IPs of the pods to monitor:

$ kubectl get pods --namespace dcgm -l "app.kubernetes.io/name=dcgm-exporter" -o "jsonpath={ .items[*].status.podIP}"
10.244.1.10 10.244.2.68

Add the DCGM pods to Prometheus’s scrape configuration. Edit the Prometheus configmap:

$ kubectl edit --namespace monitoring configmap prometheus-server

Under the scrape_configs section add a new job, using the pod IPs found above:

scrape_configs:
...
- job_name: dcgm
  static_configs:
  - targets: ['10.244.1.10:9400', '10.244.2.68:9400']
...

Prometheus will automatically reload the new configuration.

11.1.3.4 Add the Grafana Dashboard #

Import the DCGM Exporter dashboard into Grafana.

In the Grafana web interface, navigate to Manage Dashboards › Import. In the field _Import via grafana.com, enter the dashboard ID 12219, and click Load.

Alternatively, download the dashboard JSON definition, and upload it with the Upload .json file button.

On the next page, in the dropdown menu Prometheus › ] › select Prometheus as the data source. Customize the dashboard name and UID to your preference. Then click menu:Import[.

The new dashboard will appear in the Grafana web interface.

Print this page