I try to understand the GPU’s architecture.
How are warps executed on a multiprocessor ? As a warp has a width of 32 threads,
and auch SM has 8 SPs, I do not understand how a warp can be executed simultaneously
(as long as no divergence happens).
It is quite simple, it takes 4 cycles to ‘execute’ a warp (4*8 = 32) :)
I should say it takes 4 cycles to start an instruction for a warp, because of the pipeline depth it takes like 20 cycles. That is why it is recommended to have at least 6 warps on a multiprocessor to hide the pipeline latency.
So a warp is not running simultaneously in reality, but from a programming point of view it is (you will not see updates written by 1 thread of a warp within another thread of that warp during the same instruction)