0% volatile GPU-util

Dear All,

I am running some jobs on GPUs, but they are too slow. I checked it with nvidia-smi and found that the volatile gpu-util is zero and the usage is also very low. The detailed information is attached. Could anyone help me out? Thank you!

Fri Sep  3 12:02:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 27%   33C    P8     6W / 180W |   1040MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:03:00.0 Off |                  N/A |
| 27%   33C    P8     6W / 180W |   1040MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:82:00.0 Off |                  N/A |
| 27%   34C    P8     6W / 180W |   1040MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:83:00.0 Off |                  N/A |
| 27%   31C    P8     6W / 180W |   1040MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1586      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    172887      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    172888      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    197411      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    197412      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    207301      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    207302      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    207542      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    207543      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    207839      C   lmp_mpi_gpu_12Aug                 103MiB |
|    0   N/A  N/A    207840      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A      1586      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    172889      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    172890      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    197413      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    197414      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    207303      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    207304      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    207544      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    207545      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    207841      C   lmp_mpi_gpu_12Aug                 103MiB |
|    1   N/A  N/A    207842      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A      1586      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    172891      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    172892      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    197415      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    197416      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    207305      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    207306      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    207546      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    207547      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    207843      C   lmp_mpi_gpu_12Aug                 103MiB |
|    2   N/A  N/A    207844      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A      1586      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    172893      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    172894      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    197417      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    197418      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    207307      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    207308      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    207548      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    207550      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    207845      C   lmp_mpi_gpu_12Aug                 103MiB |
|    3   N/A  N/A    207846      C   lmp_mpi_gpu_12Aug                 103MiB |
+-----------------------------------------------------------------------------+

If GPU-Util shows as 0%, as is the case here, the GPU is not in use. Which might explain why your jobs are running slowly, assuming they have some GPU-accelerated component(s).

FWIW, the output of the nvidia-smi summary is somewhat confusingly formatted. The first line is actually Volatile Uncorr. ECC, the second line of the box comprises two items, GPU-Util and Compute M.. Here, the ECC error count is N/A because GeForce GPUs do not support ECC, and compute mode is Default.

The volatile error ECC count is the error count since the driver was last loaded; the GPU can also track the persistent error count until errors are explicitly cleared. This latter feature is useful when one want’s to track ECC error rates long-term (days, months).