I have such a situation: I will launch serveral ncclsend/recv kernels, and then a long gemm kernel.
It often happens that after the first communication kernel ends, the next communication kernel will be delayed until the gemm is completed because there are not enough SM resources. So I’m wondering if I’ll be able to keep an SM for the nccl kernel at all times?
I had to chunk the communication because I needed to make it as a pipeline, so I couldn’t make it as one big communication kernel.