Divergent warps Divegent warps

Hello,
Im reading “Optimizing parallel reduction in cuda” by Mark Harris
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
and im tring to understand why “highly divergent warps are very inefficient”, can you please advise?

The quote is regarding the next code:

int tid=threadIdx.x;
for (int s=1;s<blockDim.x;S*=2){
if(tid%(2*s)==0)
sdata[tid]+=sdata[tid+s];
__syncthreades();
}

Thanks

It means that operations are serialized and you are unable to perform operations in a SIMD fashion. So with 16 FPUs operating in SIMD ( executing one warp (32 threads) in 2 clock cycles ) i guess you would get 1/16 the performance with complete serialization. CUDA gives the impression of each thread being completely scalar ( superscalar? ) but in fact one often want to consider not to branch a warp to much.

This is one of the downsides of SIMD but can often be alleviated by for example the use of ternary operators. I also believe the compiler uses branch “predication” for shorter if else statements which supposedly helps.

Thanks!