Loops in kernels

jacoblyles · September 2, 2009, 11:25pm

I read that CUDA serializes the execution of divergent branches within a thread warp, and I am a little confused as to what this means.

If threads in a warp contain FOR loops of different lengths, do they synchronize execution of the iterations they have in common, which results in all threads taking as long as the one with the longest loop? Or does CUDA execute every FOR loop of different length in serial?

Consider the following kernel:

__global__ void branchTest(){

int idx = threadIdx.x;

for (int i =0; i< idx; ++i){

	//do some stuff

  } 

}

If there are 32 threads in a block, does this result in the processor being occupied for 32 iterations of the loop, or for (32 + 31 + 30 … + 3 + 2 + 1) = 528 iterations?

SPWorley · September 2, 2009, 11:54pm

The warp will loop for as long as there are any threads still in the loop. It won’t serialize them. So in your example, it will take just 31 iterations, assuming this warp’s tid values range from 0 to 31. (Careful, since threadIdx.x will go up to the number of threads in the block, not just the number of threads in the warp.)

CUDAkk · September 3, 2009, 7:58am

Any flow control instruction (if, switch, do, for, while) can significantly

impact the effective instruction throughput by causing threads of the same warp to

diverge, that is, to follow different execution paths. If this happens, the different

executions paths have to be serialized, increasing the total number of instructions

executed for this warp. When all the different execution paths have completed, the

threads converge back to the same execution path.

Topic		Replies	Views
Question about divergence and loops CUDA Programming and Performance	7	7102	November 21, 2010
How many divergent branches can actually be discussed in parallel? CUDA Programming and Performance	5	3058	October 1, 2009
thread local 'for loop' question thread parallel for loop execution CUDA Programming and Performance	5	3409	August 29, 2007
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2986	December 28, 2008
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3988	December 21, 2016
Cost of serialization. The cost of wrap execution serialization CUDA Programming and Performance	5	7125	July 9, 2008
Warp branching CUDA Programming and Performance	11	10317	October 26, 2010
branch diveragence with if/while same as if one of the threads in a warp returning CUDA Programming and Performance	18	2824	December 13, 2011
warp divergence triggered by for loop CUDA Programming and Performance	2	1656	April 2, 2018
Impact of control flow on thread performance CUDA Programming and Performance	11	13957	January 17, 2008

Loops in kernels

Related topics