How to reduce branch divergence?

Hello.
I have a C code to parallelizing with OpenACC (target:NVIDIA GPU)
The code has many conditional instructions(i.e. if-else).
So it makes many branch and it occurs ‘inactive’ CUDA thread status (80% over of total running time)
And I heard NVIDIA GPU has no branch prediction and speculative execution method like CPU.
So, I’m looking for optimizing method to reduce branch divergence and improve locality of code.

I think there are 2 ways.

  1. Use CUDA grid, block and thread index to my code to join together special conditions.
    Can I determine block and thread index on source code level with PGI OpenACC?
    If try this method, I have to enable to determine block and thread index through conditions.
    i.e.)
       if(a==0) { //run on blockIdx.x=10, threadIdx.x=10  }
       else { //run on blockIdx.x=20, threadIdx.x=20  }
  1. Optimizing code in Compile time
    Is there some compiler switches to optimize conditional execution code join together?
    (i.e. GNU gcc’s -freorder-blocks-and-partitions switch)
    I’m compiling -O3 option now, but from result of profiling with nvvp, I found conditional state a lot and it makes ‘inactive’ thread status.



    Always Thanks for your help. :)

Can I determine block and thread index on source code level with PGI OpenACC?

No and if we did I’d highly recommend you not use them. The main benefit of OpenACC is performance portability across multiple accelerators. By putting in non-portable, target specific API calls, it would defeat this benefit.

If this is something you really need to do, I’d suggest writing this particular kernel in Cuda C. OpenACC is interoperable with CUDA C so work well together.

Is there some compiler switches to optimize conditional execution code join together?

Branch optimizations are enabled by default but we don’t have compiler flags which you can set to enable or disable them.


So what to do here? Branching can be a major bottleneck on a GPU due to branch divergence. Since threads in a warp are executed in SIMT (single instruction multiple threads), if one thread takes a branch, all must execute the same branch. If one or more of threads in a warp take a different branch, then they all execute all taken branches but just ignore the instructions on the other branches. If they all take the same branch, then there’s no penalty. If they all take differing branches, the code can be as much as 32x slower.

From the compiler perspective, there’s not much that can be done to help with branch divergence. The compiler has no way of knowing at compilation which thread will take which branch (plus it can change depending upon the data set). Optimizations such branch prediction can help order the branching so the ones taken more often are checked first, but this wouldn’t help much with heavily divergent threads (plus to be accurate it really requires profile guided feedback which isn’t available on the GPU).

Really this is an algorithmic issue. Are you able to reorganize your code to either reduce the number of branches or reorganize the code so that consecutive loop iterations take the same branch?

  • Mat

It highly likely that you can recover your block and thread index from your loop index variables.

Write an outside loop and force it to be mapped to gangs. Explicitly specify the size of the gang. Inside the gang loop write another loop and force it to be mapped to vector level (CUDA thread level), so that you have a direct correlation between OpenACC loop levels and CUDA grids. The index of the gang loop is now the block index, and the vector loop is now the thread index. This is equivalent to a 1D CUDA grid arrangement.

Index recovery is perfectly legal in OpenACC and actually very useful in reducing warp divergence. However, it does makes your code less portable, as the code may not run very efficiently on other architectures with different grid arrangements.

To make this scheme slightly more portable, I prefer to collapse all loops into a single one and do index recovery inside, instead of writing two loops and explicitly map them to the gang and vector levels.