So this might be a generic CUDA question, but my understanding is that a necessary condition for concurrent kernel execution is that there must be sufficient resources for different kernels. Does that mean the most aggressive code generation for each kernel might not be optimal in that each kernel would be consuming a lot of resources (e.g., registers) that prevents concurrent kernel execution?
I am asking this because I have two back to back optixLaunch in two streams that I was expecting to be executed concurrently. From NSight, I can see that the first launch has a grid size of [90, 1, 1] and block size of [64, 1, 1] and the second launch has a grid size of [85, 1, 1] and a block size of [64, 1, 1]. So by all accounts they seem to be small kernels. But both kernels do consume 106 register/thread as reported in Nsight, which I suspect is what’s preventing the two kernels to be executed concurrently.
Is this a correct understanding?