NVIDIA Developer Forums

Configuring DCGM in production

Developer Tools Other Tools System Management and Monitoring (NVML)

rplsbo August 26, 2024, 11:42pm 1

Hi, I am interested in setting up DCGM on kubernetes to monitor GPU health. I was hoping to get some answers to these questions:

What is the extent of overlap between nvidia-smi and dcgmi diag/dcgmi health ?
How does dcgmi health work? Is it running any diagnostics or passively monitoring some system logs?
Which errors here are recoverable without any external action? - DCGM Diagnostics — NVIDIA DCGM Documentation latest documentation

Topic		Replies	Views	Activity
Monitoring GPUs in Kubernetes with DCGM Technical Blog	8	1818	May 24, 2024
DCGM exporter does not export mps process id's where as it shows in nvidia-smi Docker and NVIDIA Docker	0	107	April 9, 2025
Method of calculating GPU utilization when applying NVIDIA Multi-Instance GPU System Management and Monitoring (NVML) hw , nvbugs	5	1462	January 12, 2021
"Failed to initialize NVML: Unknown Error" running nvidia-smi in a docker container only after some hours/days DGX Spark / GB10	29	502	January 27, 2026
DCGM installation OK, running? some issues Visual Profiler and nvprof	2	872	February 10, 2025
Get tensor core usage through nvml System Management and Monitoring (NVML)	4	2369	December 17, 2022
How to monitor SM utilization and SM occupancy? System Management and Monitoring (NVML)	7	12787	January 12, 2024
Nvidia-smi not present in Jetson Linux Jetson AGX Orin nvidia-smi	6	17229	February 7, 2023
DCGM reporting Max GPU Memory Used is 0 . Linux	1	793	January 30, 2020
nVidia Healthmon Cluster Management Tools! CUDA Programming and Performance	12	18889	October 13, 2011