Avoiding thread divergence

Hi,

In my code, I have to compute certain things in 2 different ways which depends if the grid index ‘i’ is odd or even. So the kernel function has a structure something like this:

some_kernel (){
  i = threadindex;
  if(i%2 == 0){
     x = some formula;
  }else{
     x = some other formula;
  }
}

clearly there is a thread divergence in the above function. If I rewrite the above function in this way :

some_kernel (){
     i = threadindex;
     x = (1-i%2)*(some formula) + (i%2)*(some other formula);
}

will this avoid thread divergence and improve the performance? If not, is there any better alternative to avoid thread divergence in such scenarios?

Thanks,
-Pranav

Presumably the thread divergence (i.e. the need to do something different based on index) comes about due to data organization. The typical suggestion is to re-organize the underlying data to have logical/decision breakpoints at least on warp boundaries, if not larger.

Neither the first nor the second realization you have shown is guaranteed to have divergence, because the compiler has an additional tool called predication (which it will use aggressively) to avoid divergence. Without running the actual code through the compiler and inspecting the results at the machine code (SASS) level, it’s impossible to say whether either realization would lead to truly divergent code or just predicated code (or both).

It’s generally not difficult to benchmark CUDA codes. You could try both realizations and see which one is faster. But data reorganization to arrange breakpoints at least on warp boundaries may give some improvement.

I didn’t know about the “predication” tool that you just mentioned and assumed that thread divergence will occur (maybe because I read it this way).

Thanks, I will do some simple benchmarks to check if anything interesting happens!

Than you,
Pranav

The second one will obviously compute both formulas!
You may consider partitioning by even/odd warps.

I agree that It’ll compute both the formulas but all threads will follow the same route (meaning second case isn’t diverging?).

Sorry for probably silly questions as I’m quite new to cuda and using it since only ~1 week.

Pranavladkat,

What you ask is a very good question.
From my experience, i have achieved in many cases a small speedup when doing a math formula to replace an if-else statement. However, you should be careful to avoid using modulus (%) oeprator since it is very expensive.

As a simple rule, whenever I can do a math formula just with binary operators or simple arithmetic such as multiplication and addition, I go this way, otherwise I consider the if else statement becuase branch predication can provide the needed performance.