i learnt that a Warp is a collection of threads executing the same instruction. So basically a codeblock of 512 threads is broken down into
512/32(assuming warp size as 32) = 16 warps.
now if i have a GPU with say an SM(i created) containing 8 Sp’s . How are the Warps allocated.
is it like 1 Warp allocated to the SM for execution or (case 1)
1 warp per SP, making 8 Warps for the SM (i created). (case 2).
Assuming Case 1 is the right undersanding, then i can execute 8 threads from the “same” Warp concurrently. As 8 threads from same warp are executed in 8 SP’s
Assuming Case 2 is the right understanding , then i can execute 8 threads, each belonging to “8 different warps” concurrently.
Also for the above i assume that an SP at its basic level can handle only one thread at a time. so 8 Sp’s handling 8 threads gives a concurrent execution of 8.
I dont plan to go deeper than this level of understanding atleastfor now. Enlightening me on this would be a great help :)
Ouch, it is more complicated than that and a little different.
True, there’s 8 SP per SM. But as execution is interleaved 4-way, these 8 SP (or this SM) executes a full warp (32 threads) at once.
So the minimum is 32 threads per SM, each SP executing 4 threads of the same warp in 4 cycles (as long as no “long” instruction or memory access is involved).
All 8 SP on a single SM executes the same instruction of the same warp at a time (SIMD-alike), during 4 cycles, to executes the 32 computation of a full warp.
There’s NEVER any ability for any SP to execute another warp or another instruction of the same warp than the other (within a single SM). Atwrost an SP will stay idle (branch divergence or warp with less than 32 threads).
Allocation is 1…24 warp per SM in usual device (1…32 warps per SM in compute capability 1.3+ devices).
Oups, I forgot to add: all warp that is currently running (if register or shared memory enable it to run) on an SM are served on a round-robin-basis, usually, but this might change. So the actual behavior if you have many warps on one SM is to have them running in sequence, each one taking 4-cycles or more for each 32 concurrent threads.