I am looking for a simple example where the new independent thread scheduling would show its performance advantages in comparison with the previous per warp, lock-step mode execution.
Looking at a simple if/else example, just as presented here: https://devblogs.nvidia.com/inside-volta/ (see fig. 11 and fig. 12), I don’t see any speedup for the new approach.
Moreover, the same page says:
(1) “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units”
(2) “Note that execution is still SIMT: at any given clock cycle CUDA cores execute the same instruction for all active threads in a warp just as before”
So, even if each thread has its own call stack and program counter, the scheduler does not create a heterogeneous mix of threds: threads running in parallel will always belong to the same warp and will always execute the same instruction. So, apart from changing execution order for A, B, X, Y, what else changes? When is it faster?
Any help would be very much appreciated.