Restrictions, precautions, and operational notes for using nvidia-smi

Hello,

We are considering a configuration in which nvidia-smi is run to monitor the GPU, and temperature, utilization rate, power consumption, etc. are monitored. (OS:RHEL9)

As a reference for operational design, we would like to know if there are any common restrictions, precautions, or operational points that should be recognized.

Specifically, the following:

  1. Functional and temporary restrictions of nvidia-smi (version and environment dependency, etc.)

  2. Precautions for use (execution permissions, acquisition accuracy, resource load, etc.)

  3. Points to note in system construction and operational design (behavior when an error occurs and how to deal with it, etc.)