Monitoring GPUs in Kubernetes with DCGM

Originally published at: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/

Monitoring GPUs is critical for infrastructure or site reliability engineering (SRE) teams who manage large-scale GPU clusters for AI or HPC workloads. GPU metrics allow teams to understand workload behavior and thus optimize resource allocation and utilization, diagnose anomalies, and increase overall data center efficiency. Apart from infrastructure teams, you might also be interested in…

Hi everyone,

We look forward to hearing about your current monitoring solutions for GPUs in Kubernetes and look forward to your feedback on using DCGM!

Thanks to nice post :)
I tried to use dcgm exporter in kubernetes cluster (with version 1.15) to know ‘total gpu resource requests from pods and which pod is using GPU’.
But, our cluster use GTX and RTX series, so dcgm exporter said ’ Profiling is not supported for this group of GPUs or GPU’ .
Do you have any idea to use dcgm for GTX or RTX series? or plan to support these GPUs?

Thanks for reading

Hi ydh0924,

Thanks for reading the blog. Unfortunately, only the profiling metrics are limited to datacenter (previously “Tesla”) branded GPUs such as A100, V100 and T4. However, you can still use dcgm-exporter to get access to other GPU telemetry on GTX and RTX series. For this, you will have to override the ‘arguments’ variable and set it to nil in the Helm chart during installation. For e.g or you can modify values.yaml directly in the Helm chart.

helm install \
  --generate-name \
  gpu-helm-charts/dcgm-exporter \
  --set arguments=null

Doing so, will allow dcgm-exporter to give you all the GPU telemetry (just not the profiling metrics) and not result in an error during startup.
Hope that helps!

1 Like

Thanks for reply !!
it works in my prometheus now 😁