How to monitor GPU performance (usage, memory and Tx/Rx) in realtime?

I have a NVidia multi GPU environment running and want to monitor performance metrics in realtime:

  • GPU usage
  • GPU memory
  • NVLink and PCIe transceiving (sum of interval)

In the end, this should be machine-readable (prometheus for example), but I think I can fix this.

There are several different applications running on it, and I cannot hack the code, so going into CUDA is not an option.

In my understanding, other non-options are:

  • DCGM, since it does not show metrics about all Tx/Rx in detail
  • nvprof, since it does not operate in realtime and apparently will be deprecated
  • Nsight Compute CLI, since it does not operate in realtime and does not show transceiving in detail
  • nvidia-smi dmon, since it only shows PCI, not NVLink

How do I monitor performance?

You can use c-language (or python bindings like pyNVML…) and use NVML API - see [url]https://docs.nvidia.com/deploy/nvml-api/index.html[/url] (nvml.h is part of CUDA package).

Thank you for this hint!

I now use DCGM for basic performance metrics. For metrics about NVLINK and PCI activity, I use this exporter:
https://github.com/Beuth-Erdelt/prometheus_nvlink_exporter