About Warps how Warps are allocated to SP/SM

Whitchurch · September 11, 2009, 4:00am

i learnt that a Warp is a collection of threads executing the same instruction. So basically a codeblock of 512 threads is broken down into

512/32(assuming warp size as 32) = 16 warps.

now if i have a GPU with say an SM(i created) containing 8 Sp’s . How are the Warps allocated.

is it like 1 Warp allocated to the SM for execution or (case 1)

1 warp per SP, making 8 Warps for the SM (i created). (case 2).

Assuming Case 1 is the right undersanding, then i can execute 8 threads from the “same” Warp concurrently. As 8 threads from same warp are executed in 8 SP’s

Assuming Case 2 is the right understanding , then i can execute 8 threads, each belonging to “8 different warps” concurrently.

Also for the above i assume that an SP at its basic level can handle only one thread at a time. so 8 Sp’s handling 8 threads gives a concurrent execution of 8.

I dont plan to go deeper than this level of understanding atleastfor now. Enlightening me on this would be a great help :)

parallelis · September 11, 2009, 4:11pm

Ouch, it is more complicated than that and a little different.

True, there’s 8 SP per SM. But as execution is interleaved 4-way, these 8 SP (or this SM) executes a full warp (32 threads) at once.
So the minimum is 32 threads per SM, each SP executing 4 threads of the same warp in 4 cycles (as long as no “long” instruction or memory access is involved).

All 8 SP on a single SM executes the same instruction of the same warp at a time (SIMD-alike), during 4 cycles, to executes the 32 computation of a full warp.

There’s NEVER any ability for any SP to execute another warp or another instruction of the same warp than the other (within a single SM). Atwrost an SP will stay idle (branch divergence or warp with less than 32 threads).

Allocation is 1…24 warp per SM in usual device (1…32 warps per SM in compute capability 1.3+ devices).

Oups, I forgot to add: all warp that is currently running (if register or shared memory enable it to run) on an SM are served on a round-robin-basis, usually, but this might change. So the actual behavior if you have many warps on one SM is to have them running in sequence, each one taking 4-cycles or more for each 32 concurrent threads.

Whitchurch · September 11, 2009, 4:32pm

Topic		Replies	Views
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2167	March 19, 2011
questions about sp and sm CUDA Programming and Performance	5	3988	June 19, 2019
1 MP has 8 SP, but warp size is 32! CUDA Programming and Performance	6	3441	January 22, 2009
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1425	April 15, 2009
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2818	March 2, 2009
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1406	October 1, 2009
768 threads vs warp CUDA Programming and Performance	2	1458	August 16, 2009
Warp thread Scheduling CUDA Programming and Performance	7	2243	June 28, 2010
Execution of warps CUDA Programming and Performance	1	1552	January 7, 2009
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	1033	September 28, 2023

About Warps how Warps are allocated to SP/SM

Related topics