I’ve encountered an issue while testing kernel parallel scheduling. When I organize all kernel implementations in the same module and load the entire module at the beginning of execution, then call all kernels sequentially, kernels that should run in parallel end up executing serially.
However, if I organize all kernel implementations in the same module but only call each kernel right before it needs to execute, everything works as expected - parallel kernels run in parallel and synchronized kernels run synchronously.
Could this be because I’m calling too many kernels at once (hundreds of them), causing some CUDA underlying mechanism to schedule them serially? Or is there another reason for this behavior?
Any insights would be greatly appreciated.