Nvidia nvmlInit() blocks simultaneous calls

This issue appears when there are multiple GPU applications running and they call nvmlInit() simultaneously from Nvidia library.

The symptom is that GPU applications is hanging at calling nvmlInit() for a while.

How to reproduce?
We can see the delay by running few hundreds “time nvidia-smi &” simultaneously on one gpu node.
time nvidia-smi & time nvidia-smi & time nvidia-smi & …

Example test with 200 simultaneous runs
Result: The first one takes 1.646s, but the last takes over 12 seconds.

[root@supp20 ~]# Mon Nov 1 09:40:21 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
Mon Nov 1 09:40:21 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:06:00.0 Off | 0 |
| N/A 29C P0 30W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 0 Tesla P100-PCIE… Off | 00000000:06:00.0 Off | 0 |
| N/A 29C P0 30W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-PCIE… Off | 00000000:81:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-PCIE… Off | 00000000:81:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
| No running processes found |
±----------------------------------------------------------------------------+

real 0m1.646s
user 0m0.003s
sys 0m0.986s

//The last “time nvidia-smi” takes 12 seconds
real 0m12.057s
user 0m0.006s
sys 0m0.568s

This issue stops spectrum LSF from using Nvidia GPUs properly, becomes very urgent for our customers now.
Any advices and solutions from Nvidia are appreciated.

You need to enable persistence mode so that the driver remains loaded by running “sudo nvidia-smi -pm 1” at system start-up. That should make each subsequent nvidia-smi run faster.

Thanks Brent.

The real fix I’m looking for is for my application. I see my GPU application stops at nvmlInit() over 50 seconds when I simultaneously run multiple nvidia-smi and my single application.

How can I enable persistence mode for my application to stop interfering from outside APPs on nvmlInit() call?

Do we have to let all GPU APPs run in persistence mode to stop interfering each other and how?

Did you try enabling persistence mode from another console before running your applications? Persistence mode is a global system setting. Please try it out.

Thanks Brent. Not sure if you know about IBM spectrum lsf. lsf dispatches and starts user jobs on gpu nodes, lsf daemons are also checking gpus frequently. The current way, like “sudo nvidia-smi -pm 1”, seems a wrapper for every gpu app. I hope there is an API for apps to run in persistence mode. Even there is an API, users may not care about it since they don’t know their apps can block others’ ones.

I will feedback you idea about persistence mode to our dev.

What do you mean “persistence mode”?

My program (IBM Spectrum LSF) has a thread to collect GPU information every 5 seconds, and the library is open & closed in every round.

In order to avoid this issue, we increase the interval from 5 to 20 seconds.