I have a NVidia multi GPU environment running and want to monitor performance metrics in realtime:
- GPU usage
- GPU memory
- NVLink and PCIe transceiving (sum of interval)
In the end, this should be machine-readable (prometheus for example), but I think I can fix this.
There are several different applications running on it, and I cannot hack the code, so going into CUDA is not an option.
In my understanding, other non-options are:
- DCGM, since it does not show metrics about all Tx/Rx in detail
- nvprof, since it does not operate in realtime and apparently will be deprecated
- Nsight Compute CLI, since it does not operate in realtime and does not show transceiving in detail
- nvidia-smi dmon, since it only shows PCI, not NVLink
How do I monitor performance?