Are Turing's CUDA kernels divided into 4 partitions managed by 4 warp schedulers?

Assuming there is a job only needs 32 threads to execute, what’s the best schedule for performance?
Executing all work in one warp or executing them on 4 warps each warp has 8 active threads to run.

I’m not sure about whether a warp scheduler is able to access all CUDA cores within on sm.

It’s not clear what you are proposing.

If you are suggesting to launch a thread block of 32 threads, that will be composed of 4 warps of 8 threads each, that simply cannot be done.

If you are proposing to launch a threadblock of 128 threads (4 warps), where each warp only has 8 active threads, I don’t know of any reason that would be faster than a single warp of 32 threads in the general case.

A warp scheduler is not able to access all the CUDA cores in Turing, but I see no reason why breaking a schedulable instruction for 32 CUDA cores into 4 instructions each of which requires 32 CUDA cores would provide any benefit.

Anyway, you could always benchmark it.

My initial assumption is that a warp that has 8 active threads doesn’t occupy 32 CUDA cores, maybe 8 CUDA cores.

breaking a schedulable instruction for 32 CUDA cores into 4 instructions each of which requires 32 CUDA cores

If this is the case, it indeed doesn’t provide any benefits. Thanks for your explanation!