Assuming there is a job only needs 32 threads to execute, what’s the best schedule for performance?
Executing all work in one warp or executing them on 4 warps each warp has 8 active threads to run.
I’m not sure about whether a warp scheduler is able to access all CUDA cores within on sm.
If you are suggesting to launch a thread block of 32 threads, that will be composed of 4 warps of 8 threads each, that simply cannot be done.
If you are proposing to launch a threadblock of 128 threads (4 warps), where each warp only has 8 active threads, I don’t know of any reason that would be faster than a single warp of 32 threads in the general case.