How to monitor GPU performance (usage, memory and Tx/Rx) in realtime?

perdelt · October 4, 2018, 10:16pm

I have a NVidia multi GPU environment running and want to monitor performance metrics in realtime:

In the end, this should be machine-readable (prometheus for example), but I think I can fix this.

There are several different applications running on it, and I cannot hack the code, so going into CUDA is not an option.

In my understanding, other non-options are:

DCGM, since it does not show metrics about all Tx/Rx in detail
nvprof, since it does not operate in realtime and apparently will be deprecated
Nsight Compute CLI, since it does not operate in realtime and does not show transceiving in detail
nvidia-smi dmon, since it only shows PCI, not NVLink

How do I monitor performance?

anon56509511 · March 23, 2019, 11:15am

You can use c-language (or python bindings like pyNVML…) and use NVML API - see [url]https://docs.nvidia.com/deploy/nvml-api/index.html[/url] (nvml.h is part of CUDA package).

perdelt · July 29, 2019, 9:35am

Thank you for this hint!

I now use DCGM for basic performance metrics. For metrics about NVLINK and PCI activity, I use this exporter:
https://github.com/Beuth-Erdelt/prometheus_nvlink_exporter

Topic		Replies	Views
Measure SM utilization per process System Management and Monitoring (NVML)	1	1326	January 11, 2024
NVML GPU Memory Usage C example System Management and Monitoring (NVML)	0	698	March 2, 2021
How to measure GPUDirect RDMA performance? CUDA Programming and Performance	2	547	September 10, 2021
Any hardware performance counters for number of cores/SMs occupied? CUDA Programming and Performance	2	1118	January 20, 2020
How to check if an Application is running on GPU CUDA Programming and Performance	1	2343	August 9, 2019
Does nvidia have any memory bandwidth testing tools and monitoring tools, such as intel's PCM and MCL tools System Management and Monitoring (NVML) tools	6	1352	July 24, 2023
Get tensor core usage through nvml System Management and Monitoring (NVML)	4	2261	December 17, 2022
Watch Resource Usage of an SM in Real Time CUDA Programming and Performance	1	682	April 12, 2023
Is it possible to monitor gpu pcie usage per process? System Management and Monitoring (NVML)	1	1771	June 3, 2019
Multi-GPU debug CUDA Programming and Performance	0	726	February 18, 2013