Persistent Kernel does not work properly on some GPUs

This appears to be related to CUDA lazy module loading. The topic is covered in a few places, including here and here. Note that the changes were initially introduced in the CUDA 11.7 timeframe as an opt-in and then became default in the CUDA 12.2 timeframe. So one possible source of differences in observations may be the version of CUDA in each case.

Briefly, for the purpose of this discussion, a module can be thought of as the code for a kernel. In some cases, loading that kernel code may require a synchronization operation on the GPU. Synchronization basically means that all code execution activity must stop, before the sync op can complete. Operations that may introduce such synchronization that may be familiar include cudaMalloc and related, for example.

Prior to the changes introduced in CUDA 11.7 timeframe and with CUDA runtime API in view, module loading would typically be accomplished all at once at the point of CUDA initialization; typically the first CUDA runtime API call in your program/process. With all the modules loaded, there would be no need for a synchronization to load any module.

After CUDA 11.7 (opt-in) or CUDA 12.2 (default), module loading would not necessarily all be performed at once, at the beginning. Instead, some of it could be done in a “lazy” fashion, which we can interpret for this discussion as meaning “on-demand”. So the module for a kernel might get loaded the first time you call that kernel.

If that module load requires a synchronization, then all GPU execution activity in that context/process must stop, in order for that kernel to load, and subsequently run.

So in the case we have here, the first kernel (persistent) starts to run. The second kernel must begin running to avoid a hang, because we have a stream sync operation later on the stream that kernel is launched into. With lazy loading, this particular 2nd kernel seems to require a sync to load the module. But the sync waits forever because the first launched kernel “never stops”. As a result, we get a hang. The first kernel never stops, the sync is required at the point of the second kernel launch due to lazy loading, and the sync never completes. Since the sync never completes, the second kernel never starts running, so it never completes, so we hang at the stream sync point.

I won’t be able to argue the merits of this. There are certainly some possible benefits to lazy module loading. Other viewpoints are probably valid also.

At the CUDA 11.7 point, you had to opt-in to this behavior. At the CUDA 12.2 point, you have to opt-out of this behavior.

You can opt-out of this behavior by using a CUDA environment variable with your application launch:


According to my testing, this fixes the issue for this case.

The workaround that I initially pointed out works because:

  1. reversing the order of the kernel launch forces the “OneTimeKernel” to module-load. Thereafter it is loaded. This kernel completes in a short amount of time, normally.
  2. the “PersistentKernel” may also need a module load, but this is OK because we can complete the sync process: the “OneTimeKernel” will finish.
  3. Once modules for both kernels are loaded, there is no longer any interaction with the module loading system, and so subsequent persistent/concurrent activity works “as expected”

You can emulate eager loading, without use of the env var, by using cudaFuncGetAttributes() on all needed kernels, prior to entering any concurrent execution areas.

Lazy loading now has its own full section in the programming guide.