Persistent Kernel does not work properly on some GPUs

Robert_Crovella · August 25, 2023, 8:11pm

This appears to be related to CUDA lazy module loading. The topic is covered in a few places, including here and here. Note that the changes were initially introduced in the CUDA 11.7 timeframe as an opt-in and then became default in the CUDA 12.2 timeframe. So one possible source of differences in observations may be the version of CUDA in each case.

Briefly, for the purpose of this discussion, a module can be thought of as the code for a kernel. In some cases, loading that kernel code may require a synchronization operation on the GPU. Synchronization basically means that all code execution activity must stop, before the sync op can complete. Operations that may introduce such synchronization that may be familiar include cudaMalloc and related, for example.

Prior to the changes introduced in CUDA 11.7 timeframe and with CUDA runtime API in view, module loading would typically be accomplished all at once at the point of CUDA initialization; typically the first CUDA runtime API call in your program/process. With all the modules loaded, there would be no need for a synchronization to load any module.

After CUDA 11.7 (opt-in) or CUDA 12.2 (default), module loading would not necessarily all be performed at once, at the beginning. Instead, some of it could be done in a “lazy” fashion, which we can interpret for this discussion as meaning “on-demand”. So the module for a kernel might get loaded the first time you call that kernel.

If that module load requires a synchronization, then all GPU execution activity in that context/process must stop, in order for that kernel to load, and subsequently run.

So in the case we have here, the first kernel (persistent) starts to run. The second kernel must begin running to avoid a hang, because we have a stream sync operation later on the stream that kernel is launched into. With lazy loading, this particular 2nd kernel seems to require a sync to load the module. But the sync waits forever because the first launched kernel “never stops”. As a result, we get a hang. The first kernel never stops, the sync is required at the point of the second kernel launch due to lazy loading, and the sync never completes. Since the sync never completes, the second kernel never starts running, so it never completes, so we hang at the stream sync point.

I won’t be able to argue the merits of this. There are certainly some possible benefits to lazy module loading. Other viewpoints are probably valid also.

At the CUDA 11.7 point, you had to opt-in to this behavior. At the CUDA 12.2 point, you have to opt-out of this behavior.

You can opt-out of this behavior by using a CUDA environment variable with your application launch:

CUDA_MODULE_LOADING=EAGER ./my_app

According to my testing, this fixes the issue for this case.

The workaround that I initially pointed out works because:

reversing the order of the kernel launch forces the “OneTimeKernel” to module-load. Thereafter it is loaded. This kernel completes in a short amount of time, normally.
the “PersistentKernel” may also need a module load, but this is OK because we can complete the sync process: the “OneTimeKernel” will finish.
Once modules for both kernels are loaded, there is no longer any interaction with the module loading system, and so subsequent persistent/concurrent activity works “as expected”

You can emulate eager loading, without use of the env var, by using cudaFuncGetAttributes() on all needed kernels, prior to entering any concurrent execution areas.

Lazy loading now has its own full section in the programming guide.

Topic		Replies	Views
Kernels after a persistent kernel isn't executed unless running under Nsight System CUDA Programming and Performance cuda , kernel , nsight , nvcc	12	1267	August 25, 2023
Persistent kernel runs slower when with more threads Jetson Orin NX cuda	6	194	October 14, 2024
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	757	October 8, 2016
Persistent kernel runs slower when with more threads CUDA Programming and Performance	7	174	October 2, 2024
Inifinite loop with multi-stream shared data synchronisation starting with cuda 12.2 and nvidia driver version 535 CUDA Programming and Performance	3	473	September 12, 2023
Kernel doesn't start while perstistent kernel is running CUDA Programming and Performance kernel	1	353	May 10, 2022
Kernels not running concurrently in different dedicated streams CUDA Programming and Performance	3	139	April 29, 2025
Persistent Kernel Not Responding to Flag Updates on NVIDIA H100 NVL (CUDA 12.7) CUDA Programming and Performance	10	233	March 20, 2025
The kernel warm-up phase in gpu_packet_processing example and concurrent kernels Getting Started & Resources	1	84	April 1, 2025
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1899	July 19, 2022

Persistent Kernel does not work properly on some GPUs

Related topics