On Volta, there is not one PC per warp, but 32 registers that hang on to the PC for each thread. Thus, there is no implicit assumption of execution in lock-step fashion, although I understand that a scheduler optimizer will still attempt to get all threads to execute the same instruction for better performance. Is my understanding correct thus far?
The question that I am curious about is as follows: Say warp 5 is handled on one SM by “Warp Scheduler A”.
Assume the warp is diverged, and that threads 0 through 9 are about to execute instruction “foo” while threads 10 through 31 would like to execute next instruction “bar”.
Is it true that we still cannot have on Volta “Warp Scheduler A” issue for execution at the same time “foo” and “bar”? And this is why the scheduler optimizer has the mission of bringing the threads in the warp to execute the same instruction, given that the threads could go anywhere they want since they have their own PC?