Monitoring GPUs in Kubernetes with DCGM

jwitsoe · November 4, 2020, 10:59pm

Originally published at: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/

Monitoring GPUs is critical for infrastructure or site reliability engineering (SRE) teams who manage large-scale GPU clusters for AI or HPC workloads. GPU metrics allow teams to understand workload behavior and thus optimize resource allocation and utilization, diagnose anomalies, and increase overall data center efficiency. Apart from infrastructure teams, you might also be interested in…

P_Ramarao · November 5, 2020, 3:11am

Hi everyone,

We look forward to hearing about your current monitoring solutions for GPUs in Kubernetes and look forward to your feedback on using DCGM!

kade · November 5, 2020, 7:52am

Thanks to nice post :)
I tried to use dcgm exporter in kubernetes cluster (with version 1.15) to know ‘total gpu resource requests from pods and which pod is using GPU’.
But, our cluster use GTX and RTX series, so dcgm exporter said ’ Profiling is not supported for this group of GPUs or GPU’ .
Do you have any idea to use dcgm for GTX or RTX series? or plan to support these GPUs?

Thanks for reading

P_Ramarao · November 5, 2020, 7:30pm

Hi ydh0924,

Thanks for reading the blog. Unfortunately, only the profiling metrics are limited to datacenter (previously “Tesla”) branded GPUs such as A100, V100 and T4. However, you can still use dcgm-exporter to get access to other GPU telemetry on GTX and RTX series. For this, you will have to override the ‘arguments’ variable and set it to nil in the Helm chart during installation. For e.g or you can modify values.yaml directly in the Helm chart.

helm install \
  --generate-name \
  gpu-helm-charts/dcgm-exporter \
  --set arguments=null

Doing so, will allow dcgm-exporter to give you all the GPU telemetry (just not the profiling metrics) and not result in an error during startup.
Hope that helps!

kade · November 6, 2020, 1:08am

Thanks for reply !!
it works in my prometheus now 😁

ianil.irana · April 29, 2021, 6:15pm

Hi,

Just installed DCGM on linux server for two GPU servers but I do not see instance drop down in Grafana to select GPU server1 or GPU server2. How can I do that? I do see data for both servers being scraped.

Thanks,

Anil

user109130 · December 28, 2021, 3:33am

Hello. This column explains how to use dcgm to show the usage of gpus allocated to each node.
But is there a way to know how much each gpu is used by the pod it is assigned to?
For example, Node A is using 4 GPUs.
Pod A and Pod B each use Node A’s 2 GPUs.
At this time, I want to know the GPU usage of Pod A. If PodA has 100% GPU usage and PodA has 0 usage, then the total GPU usage on NodeA will be 50%.
Can I know the gpu usage of each pod to which the gpu is allocated?

P_Ramarao · January 6, 2022, 12:30am

Hi @user109130

dcgm-exporter takes advantage of the pod resources API provided by the kubelet to associate pods to specific devices (or resources) assigned to the pods. You can read more about this API (which was graduated to GA in v1.20:
https://kubernetes.io/blog/2020/12/16/third-party-device-metrics-reaches-ga/

If you use the default daemonset provided for dcgm-exporter, then you will observe that we volume-mount pod-resources to access this API capability:

github.com

NVIDIA/dcgm-exporter/blob/main/dcgm-exporter.yaml#L53


      
                  name: "dcgm-exporter"
                  ports:
                  - name: "metrics"
                    containerPort: 9400
                  securityContext:
                    runAsNonRoot: false
                    runAsUser: 0
                  volumeMounts:
                  - name: "pod-gpu-resources"
                    readOnly: true
                    mountPath: "/var/lib/kubelet/pod-resources"
                volumes:
                - name: "pod-gpu-resources"
                  hostPath:
                    path: "/var/lib/kubelet/pod-resources"
          
          
---
          
          
kind: Service
          apiVersion: v1
          metadata:

Once you deploy dcgm-exporter, you should then be able to see exactly which GPU is assigned to the pod and thus metrics for that specific GPU (even if a node has multiple GPUs). We also emit the node name as part of the dcgm-exporter output to make it easy for scrapers to gather information on exactly which GPUs within that node were assigned to pods and metrics for those. An example dcgm-exporter output is shown below where we print the GPU UUID, the hostname, and power usage of the GPUs assigned to the pod:

DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-f2767511-aa82-fbeb-7f4b-9a1bec879e78",device="nvidia0",modelName="Tesla V100-SXM2-16GB",Hostname="dcgm-exporter-1641419650-jxqn2",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 24.523000

Please use the latest release of the dcgm-exporter: 2.3.1-2.6.1 as it contains the most recent new features and bug fixes.

Thanks for reading our blog post!

yingzilnn01 · May 24, 2024, 9:48am

Why do two GPUs appear in the granfa display in an environment with one GPU during a fine-tuning job loading process, which means the utilization of GPUs is from 0% to 100%
tensor-core-utilization-grafana

Topic		Replies	Views
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes Technical Blog	0	459	August 25, 2020
Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes Technical Blog llama	1	23	October 22, 2024
Improving GPU Utilization in Kubernetes Technical Blog	11	1856	September 25, 2024
Deploying NVIDIA Triton at Scale with MIG and Kubernetes Technical Blog	0	621	August 26, 2021
Deploying GPUDirect RDMA on EGX Stack with the Mellanox Network Operator Technical Blog	0	454	September 30, 2020
Getting Kubernetes ready for the NVIDIA A100 GPU with Multi-Instance GPU Technical Blog	4	648	November 8, 2022
Create RAG Applications Using NVIDIA NIM and Haystack on Kubernetes Technical Blog	1	105	June 28, 2024
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1817	June 26, 2023
NVIDIA Docker: GPU Server Application Deployment Made Easy Technical Blog	28	1553	December 22, 2019
Poor multithreading performance compared to DX12 Vulkan	17	5399	September 29, 2020

Monitoring GPUs in Kubernetes with DCGM

Related topics