Assuming there is a job only needs 32 threads to execute, what’s the best schedule for performance?
Executing all work in one warp or executing them on 4 warps each warp has 8 active threads to run.
I’m not sure about whether a warp scheduler is able to access all CUDA cores within on sm.
It’s not clear what you are proposing.
If you are suggesting to launch a thread block of 32 threads, that will be composed of 4 warps of 8 threads each, that simply cannot be done.
If you are proposing to launch a threadblock of 128 threads (4 warps), where each warp only has 8 active threads, I don’t know of any reason that would be faster than a single warp of 32 threads in the general case.
A warp scheduler is not able to access all the CUDA cores in Turing, but I see no reason why breaking a schedulable instruction for 32 CUDA cores into 4 instructions each of which requires 32 CUDA cores would provide any benefit.
Anyway, you could always benchmark it.
My initial assumption is that a warp that has 8 active threads doesn’t occupy 32 CUDA cores, maybe 8 CUDA cores.
breaking a schedulable instruction for 32 CUDA cores into 4 instructions each of which requires 32 CUDA cores
If this is the case, it indeed doesn’t provide any benefits. Thanks for your explanation!