Nvidia-smi really slow to execute

The command nvidia-smi has recently been really slow to execute. In this example, it takes 1m45 to run:

# time nvidia-smi
Thu Jan  7 17:48:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   68C    P0   167W / 300W |  30578MiB / 32510MiB |     59%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   50C    P0   108W / 300W |  30576MiB / 32510MiB |     64%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   38C    P0    65W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   59C    P0    89W / 300W |  22868MiB / 32510MiB |     51%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12808      C   python3                                    30565MiB |
|    1     12023      C   python3                                    30563MiB |
|    3     32012      C   python3                                    22855MiB |
+-----------------------------------------------------------------------------+

real    1m45.791s
user    0m0.000s
sys     1m45.736s

Since this environment is relatively new to us, I’d like to get a pointer on how to start debugging this issue.

I attached the output of the nvidia-bug-report.sh.

nvidia-bug-report.log.gz (3.5 MB)

Thanks in advance for any help.

Emmanuel

You’ll need to have the persistence daemon (nvidia-persistenced) started on boot and have it continuously running. Otherwise, the driver gets unloaded and the Teslas deinit so when nvidia-smi is run it needs a full reinit. Failing to have the persistenced running may also lead to more serious issue like gpus crashing, depending on workloads.

Addendum: if despite running nvidia-persistenced nvidia-smi is still slow, you might use the forum search, there was a bug with certain driver versions once, there’s a thread about it.