I am working with a Tesla K20 on a particular project and must monitor the performance on the GPU in % usage. Fan speed and temperature would also be nice but not required. I contacted NVIDIA support about this and they said that there are no tools for monitoring the performance. Surly the engineers that built this must have a monitoring utility to measure performance, so I know there has to be something out there. Does anyone know any utilities that I can use? From what my boss has told me, the quadro monitoring devices such as MSI Afterbruner will not work. Thanks!
I would advise looking into the GPU Deployment Kit this includes Nvidia Management Library “NVML”. NVML is C-based and is utilized by Nvidia-SMI in order to attain detailed information in regards to the GPUs. Nvidia-SMI is the key component here and can be helpful in attaining the needed information via command line (Attached PDF of Commands). I have listed below links to each of the above buzz words. With Nvidia-SMI you should be able to acquire the following information in regards to your K20s… ECC error counts, GPU utilization, Active compute process, Clocks, PState, Temperature, fan speed, Power management, Board Identification.
If using windows with the latest drivers the path to nvidia-smi should be C:\Program Files\NVIDIA Corporation\NVSMI.
GPU Deployment Kit = https://developer.nvidia.com/gpu-deployment-kit
Nvidia Management Library “NVML” = https://developer.nvidia.com/nvidia-management-library-nvml
Nvidia-SMI = https://developer.nvidia.com/nvidia-system-management-interface
Alternatively depending on the server manufacturer you may can use the OEM tools to gather basic information such as temperature and fan speed from the GPUs. For example with some of the newer SMC servers it is possible to visualize the current status of the GPU from the out of band management IPMI portal.
nvidia-smi.331.38.pdf (49.4 KB)