Question regarding conditionals in kernels

Hey,

I’ve searched around the forums for anything regarding this subject, but nothing turned up. Apologies if it has been posted before!

I have been writing code for physical simulations for a few years, and recently we have decided to port our code over to CUDA. The situation is basically that we have a 2D sheet on multiple separate cells with different properties, which all need a slightly different equation applied to the same variables. In simple pseudocode, something like:

for(x = 0, x < 1000, x ++){
for(y = 0, y < 1000, y ++){

     if(a[x][y] = 1), b[x][y] = equation1 
     if(a[x][y] = 2), b[x][y] = equation2 

 }

}

Now, as I understand it, this can be easily implemented in CUDA by writing:

global void calc(double *a, double *b)

{
int idx = blockIdx.x * BlockDim.x + threadIdx.x;

     if(a[idx] = 1), b[idx] = equation1 
     if(a[idx] = 2), b[idx] = equation2 

}

However, I have heard that using conditionals withing a loop adversely affects performance. In the full code, we have around 10-11 different states rather than the two shown here. My question is, to what degree is performance effected by a statement like this? If conditions are going to reduce my speed by a matter of %, then I believe it will still be faster than moving arrays around in the main code before posting off to the GPU. However, if this will slow my kernel down considerably…

Thanks, and once again, apologies if this is a silly/newbie question.

Jon

From my limited experience (I’ve spent the last month programming CUDA FDTD simulation), having 10 conditionals in a kernel is punishable by death. Try using the same equation for every thread and see what happens. This won’t produce physically meaningful results but it will be performing the same number of calculations without thread divergence.

If all threads in a warp use the same equation, then that is the only case where this will be fine.

But assuming any thread in a warp can go any of 10 ways, what will happen is you will first execute the threads which use equation 1. While this is happening, all other threads in the warp are completely idle. The remaining threads will move onto conditional 2, and now the threads which use equation two will update while the rest of the threads remain idle, and the processors they use are completely idle. Basically you are going to increase the time to run the kernel by a factor of 10.

Somebody feel free to correct me if I’m wrong, this is very important to me. This is the exact issue which is killing my FDTD algorithm right now. Currently I am just forgoing the if statement and updating all threads in the exact same way. This greatly limits shared memory use for me.

Ouch, I was worried that might be the case. I’ll try to test it out over this weekend and see what happens, but if this does happen it likely won’t be viable. Thing is, even if I separate the different types with in the main code, then offload each to the GPU in turn, I’ll then be calling each kernel 10 times, and resulting in a similar situation?

How about a situation like

global void calc(double *b, double *c, double *d)

{

int idx = blockIdx.x * BlockDim.x + threadIdx.x;

b[idx] = c[idx]*equation1 + d[idx]*equation2

}

Where c,d, etc are 1 or 0 depending on the type, do you think the massive volume of extra memory would be worth removing the conditionals?

Again, I’ll probably test as much of this as I can this weekend, but it’s always good to throw som ideas out there in case someone can give me a flat out - “no, don’t be stupid”.

Also, Cheers for the fast reply!

Jon

Well it doesn’t look like that kernel does the same thing as the first one you posted, but I’ll ignore that.

What I’ve found is eliminating if statements from kernels in most cases will increase your overall performance, no matter what price this comes at.

In my FDTD code, I can make massive use of shared memory if I put two or three if statements in my kernel, which allow me load most of my input values from shared memory, while still loading a few threads from global memory because those few threads are required to use global memory. This significant reduction in the use of global memory does not even come close to offsetting increase in run time caused by the introduction of the if statements.

The GPU is a vector architecture. If one thread in the warp performs an instruction then all do. Threads that are turned of by a conditional just dump the result and skip actual loads/stores in shared/global/local memory.

If each warp uses just one of the if statements then you get optimal performance (well you waste a bit of time actually calculating the if, but you need to do that anyway and it’s negligible assuming that the equations do enough work). If warps don’t diverge too much than you may still gain enough (depends on whether your kernel is bandwidth or computation limited). The more you diverge the more you pay (and the more computationally bound you are the more you will feel it, i.e for a completely computational bound kernel and fully 8 way divergent warp you will get 8th the performance)