Avoiding thread divergence

pranavladkat · December 11, 2014, 10:50pm

Hi,

In my code, I have to compute certain things in 2 different ways which depends if the grid index ‘i’ is odd or even. So the kernel function has a structure something like this:

some_kernel (){
  i = threadindex;
  if(i%2 == 0){
     x = some formula;
  }else{
     x = some other formula;
  }
}

clearly there is a thread divergence in the above function. If I rewrite the above function in this way :

some_kernel (){
     i = threadindex;
     x = (1-i%2)*(some formula) + (i%2)*(some other formula);
}

will this avoid thread divergence and improve the performance? If not, is there any better alternative to avoid thread divergence in such scenarios?

Thanks,
-Pranav

Robert_Crovella · December 12, 2014, 12:03am

Presumably the thread divergence (i.e. the need to do something different based on index) comes about due to data organization. The typical suggestion is to re-organize the underlying data to have logical/decision breakpoints at least on warp boundaries, if not larger.

Neither the first nor the second realization you have shown is guaranteed to have divergence, because the compiler has an additional tool called predication (which it will use aggressively) to avoid divergence. Without running the actual code through the compiler and inspecting the results at the machine code (SASS) level, it’s impossible to say whether either realization would lead to truly divergent code or just predicated code (or both).

It’s generally not difficult to benchmark CUDA codes. You could try both realizations and see which one is faster. But data reorganization to arrange breakpoints at least on warp boundaries may give some improvement.

pranavladkat · December 12, 2014, 12:13am

I didn’t know about the “predication” tool that you just mentioned and assumed that thread divergence will occur (maybe because I read it this way).

Thanks, I will do some simple benchmarks to check if anything interesting happens!

Than you,
Pranav

Vectorizer · December 12, 2014, 12:24am

The second one will obviously compute both formulas!
You may consider partitioning by even/odd warps.

pranavladkat · December 12, 2014, 1:41am

I agree that It’ll compute both the formulas but all threads will follow the same route (meaning second case isn’t diverging?).

Sorry for probably silly questions as I’m quite new to cuda and using it since only ~1 week.

neoideo · December 12, 2014, 12:13pm

Pranavladkat,

What you ask is a very good question.
From my experience, i have achieved in many cases a small speedup when doing a math formula to replace an if-else statement. However, you should be careful to avoid using modulus (%) oeprator since it is very expensive.

As a simple rule, whenever I can do a math formula just with binary operators or simple arithmetic such as multiplication and addition, I go this way, otherwise I consider the if else statement becuase branch predication can provide the needed performance.

Topic		Replies	Views
Thread Divergence CUDA Programming and Performance	5	2844	June 1, 2010
does a switch statement by thread id cause divergence CUDA Programming and Performance	5	3295	January 7, 2011
branch predication CUDA Programming and Performance	0	3782	November 12, 2009
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21426	March 12, 2007
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	945	January 8, 2020
Shift direction and divergence CUDA Programming and Performance	7	501	November 13, 2020
Problem on thread divergance. CUDA Programming and Performance	1	2296	March 23, 2009
Diverge-free doesn't win 32x over Diverge-all warp divergence CUDA Programming and Performance	6	3232	September 14, 2007
How subject to performance loss is : if (idx < n) { .... } ? CUDA Programming and Performance	7	1618	July 13, 2015
Thread divergence due to IF CUDA Programming and Performance	3	6915	September 13, 2007

Avoiding thread divergence

Related topics