Im reading “Optimizing parallel reduction in cuda” by Mark Harris
and im tring to understand why “highly divergent warps are very inefficient”, can you please advise?

The quote is regarding the next code:

int tid=threadIdx.x;
for (int s=1;s<blockDim.x;S*=2){


It means that operations are serialized and you are unable to perform operations in a SIMD fashion. So with 16 FPUs operating in SIMD ( executing one warp (32 threads) in 2 clock cycles ) i guess you would get 1/16 the performance with complete serialization. CUDA gives the impression of each thread being completely scalar ( superscalar? ) but in fact one often want to consider not to branch a warp to much.

This is one of the downsides of SIMD but can often be alleviated by for example the use of ternary operators. I also believe the compiler uses branch “predication” for shorter if else statements which supposedly helps.