Hello,
We are considering a configuration in which nvidia-smi is run to monitor the GPU, and temperature, utilization rate, power consumption, etc. are monitored. (OS:RHEL9)
As a reference for operational design, we would like to know if there are any common restrictions, precautions, or operational points that should be recognized.
Specifically, the following:
-
Functional and temporary restrictions of nvidia-smi (version and environment dependency, etc.)
-
Precautions for use (execution permissions, acquisition accuracy, resource load, etc.)
-
Points to note in system construction and operational design (behavior when an error occurs and how to deal with it, etc.)