Issue with CUDA Kernel Parallel Scheduling

I’ve encountered an issue while testing kernel parallel scheduling. When I organize all kernel implementations in the same module and load the entire module at the beginning of execution, then call all kernels sequentially, kernels that should run in parallel end up executing serially.

However, if I organize all kernel implementations in the same module but only call each kernel right before it needs to execute, everything works as expected - parallel kernels run in parallel and synchronized kernels run synchronously.

Could this be because I’m calling too many kernels at once (hundreds of them), causing some CUDA underlying mechanism to schedule them serially? Or is there another reason for this behavior?

Any insights would be greatly appreciated.


Do you observe any changes to this behavior after you set the environment variable CUDA_MODULE_LOADING=EAGER

Looking at your profiler output, you seem to have over 100 streams.

This post may explain what you’re seeing.