Setting Up GPU Telemetry with NVIDIA Data Center GPU Manager

Originally published at: Setting Up GPU Telemetry with NVIDIA Data Center GPU Manager | NVIDIA Technical Blog

Understanding GPU usage provides important insights for IT administrators managing a data center. Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation, diagnose anomalies, and increase overall data center efficiency. NVIDIA Data Center GPU Manager (DCGM) offers a comprehensive tool suite to simplify administration and monitoring of NVIDIA Tesla-accelerated data…

Is it possible to install it with Cloud GPU instance?

Yes

looking to run the poc python file in the blog have all the relevant pre-requisite running on a vm such as dcgm and nv host engine along with Prometheus as well

unable to run the python file complains of a module and there is no way it can be installed via pip as well i dont find this module

what are the ways i can get dcgm metrics so i can get gpu metrics to display it via prometheous to grafana

python inheritance_reader_example.py
Traceback (most recent call last):
File “inheritance_reader_example.py”, line 2, in
from DcgmReader import DcgmReader
ImportError: No module named ‘DcgmReader’

$ python dictionary_reader_example.py
Traceback (most recent call last):
File “dictionary_reader_example.py”, line 2, in
from DcgmReader import DcgmReader
ImportError: No module named ‘DcgmReader’

You likely need to add the directory containing DcgmReader.py to your PYTHONPATH environment.

For example, on my system, DcgmReader.py is in /usr/src/dcgm/bindings, so I would run
PYTHONPATH=/usr/src/dcgm/bindings python dictionary_reader_example.py