I read that CUDA serializes the execution of divergent branches within a thread warp, and I am a little confused as to what this means.
If threads in a warp contain FOR loops of different lengths, do they synchronize execution of the iterations they have in common, which results in all threads taking as long as the one with the longest loop? Or does CUDA execute every FOR loop of different length in serial?
Consider the following kernel:
__global__ void branchTest(){
int idx = threadIdx.x;
for (int i =0; i< idx; ++i){
//do some stuff
}
}
If there are 32 threads in a block, does this result in the processor being occupied for 32 iterations of the loop, or for (32 + 31 + 30 … + 3 + 2 + 1) = 528 iterations?
The warp will loop for as long as there are any threads still in the loop. It won’t serialize them. So in your example, it will take just 31 iterations, assuming this warp’s tid values range from 0 to 31. (Careful, since threadIdx.x will go up to the number of threads in the block, not just the number of threads in the warp.)