Question regarding conditionals in kernels

Epi · July 30, 2010, 2:22pm

Hey,

I’ve searched around the forums for anything regarding this subject, but nothing turned up. Apologies if it has been posted before!

I have been writing code for physical simulations for a few years, and recently we have decided to port our code over to CUDA. The situation is basically that we have a 2D sheet on multiple separate cells with different properties, which all need a slightly different equation applied to the same variables. In simple pseudocode, something like:

for(x = 0, x < 1000, x ++){
for(y = 0, y < 1000, y ++){

     if(a[x][y] = 1), b[x][y] = equation1 
     if(a[x][y] = 2), b[x][y] = equation2 

 }

}

Now, as I understand it, this can be easily implemented in CUDA by writing:

global void calc(double *a, double *b)

{
int idx = blockIdx.x * BlockDim.x + threadIdx.x;

     if(a[idx] = 1), b[idx] = equation1 
     if(a[idx] = 2), b[idx] = equation2

}

However, I have heard that using conditionals withing a loop adversely affects performance. In the full code, we have around 10-11 different states rather than the two shown here. My question is, to what degree is performance effected by a statement like this? If conditions are going to reduce my speed by a matter of %, then I believe it will still be faster than moving arrays around in the main code before posting off to the GPU. However, if this will slow my kernel down considerably…

Thanks, and once again, apologies if this is a silly/newbie question.

Jon

Eric3918 · July 30, 2010, 2:53pm

From my limited experience (I’ve spent the last month programming CUDA FDTD simulation), having 10 conditionals in a kernel is punishable by death. Try using the same equation for every thread and see what happens. This won’t produce physically meaningful results but it will be performing the same number of calculations without thread divergence.

If all threads in a warp use the same equation, then that is the only case where this will be fine.

But assuming any thread in a warp can go any of 10 ways, what will happen is you will first execute the threads which use equation 1. While this is happening, all other threads in the warp are completely idle. The remaining threads will move onto conditional 2, and now the threads which use equation two will update while the rest of the threads remain idle, and the processors they use are completely idle. Basically you are going to increase the time to run the kernel by a factor of 10.

Somebody feel free to correct me if I’m wrong, this is very important to me. This is the exact issue which is killing my FDTD algorithm right now. Currently I am just forgoing the if statement and updating all threads in the exact same way. This greatly limits shared memory use for me.

Epi · July 30, 2010, 3:08pm

Ouch, I was worried that might be the case. I’ll try to test it out over this weekend and see what happens, but if this does happen it likely won’t be viable. Thing is, even if I separate the different types with in the main code, then offload each to the GPU in turn, I’ll then be calling each kernel 10 times, and resulting in a similar situation?

How about a situation like

global void calc(double *b, double *c, double *d)

{

int idx = blockIdx.x * BlockDim.x + threadIdx.x;

b[idx] = c[idx]*equation1 + d[idx]*equation2

}

Where c,d, etc are 1 or 0 depending on the type, do you think the massive volume of extra memory would be worth removing the conditionals?

Again, I’ll probably test as much of this as I can this weekend, but it’s always good to throw som ideas out there in case someone can give me a flat out - “no, don’t be stupid”.

Also, Cheers for the fast reply!

Jon

Eric3918 · July 30, 2010, 3:43pm

Well it doesn’t look like that kernel does the same thing as the first one you posted, but I’ll ignore that.

What I’ve found is eliminating if statements from kernels in most cases will increase your overall performance, no matter what price this comes at.

In my FDTD code, I can make massive use of shared memory if I put two or three if statements in my kernel, which allow me load most of my input values from shared memory, while still loading a few threads from global memory because those few threads are required to use global memory. This significant reduction in the use of global memory does not even come close to offsetting increase in run time caused by the introduction of the if statements.

laughingrice · July 31, 2010, 1:37pm

Hey,

I’ve searched around the forums for anything regarding this subject, but nothing turned up. Apologies if it has been posted before!

I have been writing code for physical simulations for a few years, and recently we have decided to port our code over to CUDA. The situation is basically that we have a 2D sheet on multiple separate cells with different properties, which all need a slightly different equation applied to the same variables. In simple pseudocode, something like:

for(x = 0, x < 1000, x ++){
   for(y = 0, y < 1000, y ++){
if(a[y] = 1), b[y] = equation1
     if(a[x][y] = 2), b[x][y] = equation2 
}

}

Now, as I understand it, this can be easily implemented in CUDA by writing:

global void calc(double *a, double *B)

{
int idx = blockIdx.x * BlockDim.x + threadIdx.x;
if(a[idx] = 1), b[idx] = equation1
     if(a[idx] = 2), b[idx] = equation2 
}

However, I have heard that using conditionals withing a loop adversely affects performance. In the full code, we have around 10-11 different states rather than the two shown here. My question is, to what degree is performance effected by a statement like this? If conditions are going to reduce my speed by a matter of %, then I believe it will still be faster than moving arrays around in the main code before posting off to the GPU. However, if this will slow my kernel down considerably…

Thanks, and once again, apologies if this is a silly/newbie question.

Jon

The GPU is a vector architecture. If one thread in the warp performs an instruction then all do. Threads that are turned of by a conditional just dump the result and skip actual loads/stores in shared/global/local memory.

If each warp uses just one of the if statements then you get optimal performance (well you waste a bit of time actually calculating the if, but you need to do that anyway and it’s negligible assuming that the equations do enough work). If warps don’t diverge too much than you may still gain enough (depends on whether your kernel is bandwidth or computation limited). The more you diverge the more you pay (and the more computationally bound you are the more you will feel it, i.e for a completely computational bound kernel and fully 8 way divergent warp you will get 8th the performance)

Topic		Replies	Views
How subject to performance loss is : if (idx < n) { .... } ? CUDA Programming and Performance	7	1468	July 13, 2015
Thread question CUDA Programming and Performance	5	1878	December 2, 2008
Some Performance Consideration Questions warp divergence, coalescing and shared mem then and now CUDA Programming and Performance	1	2007	March 8, 2012
cant call any kernel function CUDA Programming and Performance	8	4833	June 6, 2011
race conditioning stopping hang time CUDA Programming and Performance	5	2413	September 4, 2008
Relaxed __syncthreads() proposal. CUDA Programming and Performance	15	11540	January 2, 2011
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21237	March 12, 2007
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4490	October 24, 2008
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6511	May 9, 2011
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6002	March 25, 2010

Question regarding conditionals in kernels

Related topics