Optimization to Reduce Bank Conflicts Decreases Performance

In my application, I load a value from a large shared memory array into a register.

for(j=0; j < 10; j++){

   i = (equation to compute the array index I need);

   x = shared_memory[i];

   //use x to do stuff

}

Since the access pattern on this array is not sequential on thread IDs, I believe this is causing bank conflicts. Also, because this code runs inside a loop, and about half the time, the previous iteration of the loop accessed the same array index, roughly half of the shared memory accesses are unnecessary. (The correct value is already loaded in x.)

It seems like the following code, which only accesses shared memory when the index changes, should be faster:

for(j=0; j < 10; j++){

   i = (equation to compute the array index I need);

   if(i != old_i) x = shared_memory[i];

   old_i = i;

   //use x to do stuff

}

However, this does not improve performance. In fact, it makes my application run much slower. Any ideas why?

potential warp divergence, but you need a profiler to see that.

Also, are you sure there are conflicts in the original? profile it as well to see how many divergent warps there is.

Smem is fast, this kind of stuff doesn’t make much sense unless there is a lot of conflicts.

I can understand how this optimization could provide little or no benefit. However, based on my understanding of divergence, I can’t rationalize how this could increase execution time.

From the CUDA programming Guide:

If I am interpreting this correctly, then adding a conditional should not increase runtime (except for the marginal amount of time it takes to test the condition). Some threads in the warp enter the block while others do nothing at all. When the block is completed, since there is no else statement, the idle threads start up again.

Can you help me rationalize the increase in runtime? I’m writing a paper on the application of the GPU to a certain class of problems and it would be helpful if I could include an explanation of the optimizations used to achieve my performance claims.

Well, first, that would actually depend how do you define “much”.

One reason could be that the second version uses more registers and crosses occupancy barrier.

But that’s all my crystal ball can tell. I can’t say much without looking closely at your real code, even the divergence is not certain, since it could be done with a single conditional move instruction (but how does it actually compile in your case, I don’t know).