Handling of Divergent Control Flow


I’m trying to better understand what strategies GPUs use to handle divergent control flow. Specifically, I have an application where the kernel has a large number of potential branches that can be taken, and I have no expectation that adjacent elements in the input vector will take the same or similar branches. By my understanding, this should be a worst-case scenario for warp divergence. Yet, if I intentionally organize the input vector such that it’s sorted by what branch it will ultimately take, I get only the smallest of speed improvements (and, in both cases, it’s still quite fast). The calculations at the end of each branch are relatively simple, but I would still expect it to run ~32x slower in the randomized-input case.

So, how is it running so quickly? Given how many possible branches there are, I would think predication would be abysmally slow. Is there some sort of pre-checking of the conditional statements to better organize warps? Any insights would be greatly appreciated!

Have you tried using the CUDA profiler to see where the actual bottlenecks are in your code? It would probably also be a good idea to familiarize yourself with the roofline model of performance. Divergent control flow may not be a major detractor from speed-of-light performance, or there may be less actual divergence occurring than you are anticipating.

I hadn’t done that; thanks for the suggestion! After playing around with Nsight Compute for a bit and comparing times and cycle counts for different cases, it looks like there was quite a bit less warp divergence than the worst-case scenario I was picturing. So, as you said—I’ll mark your reply as the solution. Glad to see that having lots of branches can still be fast if the divergence is low.

In general CUDA programmers should not worry about divergence unless the profiler tells them to. Write code in a natural fashion. The compiler and the hardware will take care of many potential divergence issues. Once the profiler complains about divergence sapping performance it is time to re-think one’s algorithmic approach and/or think about HLL source code modifications.