Loops in kernels

I read that CUDA serializes the execution of divergent branches within a thread warp, and I am a little confused as to what this means.

If threads in a warp contain FOR loops of different lengths, do they synchronize execution of the iterations they have in common, which results in all threads taking as long as the one with the longest loop? Or does CUDA execute every FOR loop of different length in serial?

Consider the following kernel:

__global__ void branchTest(){

int idx = threadIdx.x;

for (int i =0; i< idx; ++i){

	//do some stuff

  } 

}

If there are 32 threads in a block, does this result in the processor being occupied for 32 iterations of the loop, or for (32 + 31 + 30 … + 3 + 2 + 1) = 528 iterations?

The warp will loop for as long as there are any threads still in the loop. It won’t serialize them. So in your example, it will take just 31 iterations, assuming this warp’s tid values range from 0 to 31. (Careful, since threadIdx.x will go up to the number of threads in the block, not just the number of threads in the warp.)

Any flow control instruction (if, switch, do, for, while) can significantly

impact the effective instruction throughput by causing threads of the same warp to

diverge, that is, to follow different execution paths. If this happens, the different

executions paths have to be serialized, increasing the total number of instructions

executed for this warp. When all the different execution paths have completed, the

threads converge back to the same execution path.