Nvidia nvmlInit() blocks simultaneous calls

user28206 · November 2, 2021, 6:33pm

This issue appears when there are multiple GPU applications running and they call nvmlInit() simultaneously from Nvidia library.

The symptom is that GPU applications is hanging at calling nvmlInit() for a while.

How to reproduce?
We can see the delay by running few hundreds “time nvidia-smi &” simultaneously on one gpu node.
time nvidia-smi & time nvidia-smi & time nvidia-smi & …

Example test with 200 simultaneous runs
Result: The first one takes 1.646s, but the last takes over 12 seconds.

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
| No running processes found |
±----------------------------------------------------------------------------+

real 0m1.646s
user 0m0.003s
sys 0m0.986s

//The last “time nvidia-smi” takes 12 seconds
real 0m12.057s
user 0m0.006s
sys 0m0.568s

This issue stops spectrum LSF from using Nvidia GPUs properly, becomes very urgent for our customers now.
Any advices and solutions from Nvidia are appreciated.

BrentS · November 5, 2021, 2:30pm

You need to enable persistence mode so that the driver remains loaded by running “sudo nvidia-smi -pm 1” at system start-up. That should make each subsequent nvidia-smi run faster.

user28206 · November 5, 2021, 2:59pm

Thanks Brent.

The real fix I’m looking for is for my application. I see my GPU application stops at nvmlInit() over 50 seconds when I simultaneously run multiple nvidia-smi and my single application.

How can I enable persistence mode for my application to stop interfering from outside APPs on nvmlInit() call?

Do we have to let all GPU APPs run in persistence mode to stop interfering each other and how?

BrentS · November 5, 2021, 3:03pm

Did you try enabling persistence mode from another console before running your applications? Persistence mode is a global system setting. Please try it out.

user28206 · November 5, 2021, 4:45pm

Thanks Brent. Not sure if you know about IBM spectrum lsf. lsf dispatches and starts user jobs on gpu nodes, lsf daemons are also checking gpus frequently. The current way, like “sudo nvidia-smi -pm 1”, seems a wrapper for every gpu app. I hope there is an API for apps to run in persistence mode. Even there is an API, users may not care about it since they don’t know their apps can block others’ ones.

I will feedback you idea about persistence mode to our dev.

user101923 · April 6, 2022, 5:17am

What do you mean “persistence mode”?

My program (IBM Spectrum LSF) has a thread to collect GPU information every 5 seconds, and the library is open & closed in every round.

In order to avoid this issue, we increase the interval from 5 to 20 seconds.

Topic		Replies	Views
Nvidia-smi really slow to execute Linux ubuntu	4	11219	August 11, 2024
`nvidia-smi` Performance degredation CUDA Programming and Performance	5	282	August 14, 2024
GPU getting stuck, not able to execute any command using GPU Linux	1	2059	January 17, 2018
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1616	October 12, 2021
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	122	December 19, 2024
nvidia-smi is slow and hangs after sometime with 1080Ti CUDA Setup and Installation	4	6699	June 20, 2018
NVML overhead CUDA Programming and Performance	6	1940	March 24, 2020
Processes hang trying to ioctl /dev/nvidiactl CUDA Setup and Installation	6	4373	October 2, 2015
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	368	September 11, 2024
1080Ti stuck at idle clock frequency even at 100% GPU utilization Linux	5	4335	October 14, 2021

Nvidia nvmlInit() blocks simultaneous calls

Related topics