Are Turing's CUDA kernels divided into 4 partitions managed by 4 warp schedulers?

Assuming there is a job only needs 32 threads to execute, what’s the best schedule for performance?
Executing all work in one warp or executing them on 4 warps each warp has 8 active threads to run.

I’m not sure about whether a warp scheduler is able to access all CUDA cores within on sm.

It’s not clear what you are proposing.

If you are suggesting to launch a thread block of 32 threads, that will be composed of 4 warps of 8 threads each, that simply cannot be done.

If you are proposing to launch a threadblock of 128 threads (4 warps), where each warp only has 8 active threads, I don’t know of any reason that would be faster than a single warp of 32 threads in the general case.

A warp scheduler is not able to access all the CUDA cores in Turing, but I see no reason why breaking a schedulable instruction for 32 CUDA cores into 4 instructions each of which requires 32 CUDA cores would provide any benefit.

Anyway, you could always benchmark it.

My initial assumption is that a warp that has 8 active threads doesn’t occupy 32 CUDA cores, maybe 8 CUDA cores.

breaking a schedulable instruction for 32 CUDA cores into 4 instructions each of which requires 32 CUDA cores

If this is the case, it indeed doesn’t provide any benefits. Thanks for your explanation!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.