Originally published at: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/
Monitoring GPUs is critical for infrastructure or site reliability engineering (SRE) teams who manage large-scale GPU clusters for AI or HPC workloads. GPU metrics allow teams to understand workload behavior and thus optimize resource allocation and utilization, diagnose anomalies, and increase overall data center efficiency. Apart from infrastructure teams, you might also be interested in…
Hi everyone,
We look forward to hearing about your current monitoring solutions for GPUs in Kubernetes and look forward to your feedback on using DCGM!
Thanks to nice post :)
I tried to use dcgm exporter in kubernetes cluster (with version 1.15) to know ‘total gpu resource requests from pods and which pod is using GPU’.
But, our cluster use GTX and RTX series, so dcgm exporter said ’ Profiling is not supported for this group of GPUs or GPU’ .
Do you have any idea to use dcgm for GTX or RTX series? or plan to support these GPUs?
Thanks for reading
Hi ydh0924,
Thanks for reading the blog. Unfortunately, only the profiling metrics are limited to datacenter (previously “Tesla”) branded GPUs such as A100, V100 and T4. However, you can still use dcgm-exporter to get access to other GPU telemetry on GTX and RTX series. For this, you will have to override the ‘arguments’ variable and set it to nil in the Helm chart during installation. For e.g or you can modify values.yaml directly in the Helm chart.
helm install \
--generate-name \
gpu-helm-charts/dcgm-exporter \
--set arguments=null
Doing so, will allow dcgm-exporter to give you all the GPU telemetry (just not the profiling metrics) and not result in an error during startup.
Hope that helps!
Thanks for reply !!
it works in my prometheus now 😁
Hi,
Just installed DCGM on linux server for two GPU servers but I do not see instance drop down in Grafana to select GPU server1 or GPU server2. How can I do that? I do see data for both servers being scraped.
Thanks,
Anil
Hello. This column explains how to use dcgm to show the usage of gpus allocated to each node.
But is there a way to know how much each gpu is used by the pod it is assigned to?
For example, Node A
is using 4 GPUs.
Pod A and Pod B each use Node A’s 2 GPUs.
At this time, I want to know the GPU usage of Pod A. If PodA has 100% GPU usage and PodA has 0 usage, then the total GPU usage on NodeA will be 50%.
Can I know the gpu usage of each pod to which the gpu is allocated?
Hi @user109130
dcgm-exporter takes advantage of the pod resources API provided by the kubelet
to associate pods to specific devices (or resources) assigned to the pods. You can read more about this API (which was graduated to GA in v1.20:
https://kubernetes.io/blog/2020/12/16/third-party-device-metrics-reaches-ga/
If you use the default daemonset provided for dcgm-exporter, then you will observe that we volume-mount pod-resources
to access this API capability:
Once you deploy dcgm-exporter, you should then be able to see exactly which GPU is assigned to the pod and thus metrics for that specific GPU (even if a node has multiple GPUs). We also emit the node name as part of the dcgm-exporter output to make it easy for scrapers to gather information on exactly which GPUs within that node were assigned to pods and metrics for those. An example dcgm-exporter output is shown below where we print the GPU UUID, the hostname, and power usage of the GPUs assigned to the pod:
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-f2767511-aa82-fbeb-7f4b-9a1bec879e78",device="nvidia0",modelName="Tesla V100-SXM2-16GB",Hostname="dcgm-exporter-1641419650-jxqn2",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 24.523000
Please use the latest release of the dcgm-exporter: 2.3.1-2.6.1 as it contains the most recent new features and bug fixes.
Thanks for reading our blog post!
Why do two GPUs appear in the granfa display in an environment with one GPU during a fine-tuning job loading process, which means the utilization of GPUs is from 0% to 100%