Are there any branch non-divergence hints for the compiler?


I have some code where I carefully ensure that loops are multiples of the thread count, but the compiler doesn’t know that and it forces divergence/convergence logic to generated where none is needed. Is there any kind of hint for the compiler to avoid this? Something like:

if (__builtin_nondivergent_warp(condition)) ...


if (__builtin_nondivergent_block(condition)) ...


I don’t know of any guarantees provided by the CUDA programming model that indicate that a warp that is not forced to diverge will remain converged.

Thanks for the quick reply! For whatever it may be worth, I can avoid the implicit convergence logic in the SASS output if I manually unroll the loops in question. How worried should I be about instruction fetch performance if the unrolled loop is free of branches?

You might wish to file a bug with your observation then. I would need an example to better understand the situation, and even then I’m not sure I would have any insight.

The GPU should be able to do a good job of fetching straight-line code. Sure, the i$ cache might not be useful for a particular warp (the “leading” warp, I guess) in that scenario, but the GPU is after all a streaming processor. Latency hiding is the name of the game. A programmer that provides the GPU with enough opportunity to hide latency need not be worried about this, I don’t think. And your first order of business is to provide good opportunities to the GPU for latency hiding, not address i$ cache bubbles.

Sure, here is an example with two for loops that should in theory never diverge (as a block or as a warp). The first for loop adds the first half of the inputs and the second for loop multiplies the second half of the inputs. If you look at the SASS output, one can observe two things:

  1. A BSYNC between the two for loops. Why? Everything should implicitly be in sync right? How could anything diverge in this example?
  2. The compiler fails to constant propagate the thread count into an immediate operand to the IADD3 and instead uses the constant cache.
#include <stdint.h>

namespace {

constexpr uint32_t threads = 128;
constexpr uint32_t dataSize = 10 * threads;

testKernel(uint32_t *out, const uint32_t *x, const uint32_t *y) {
    uint32_t i = threadIdx.x;
    for (; i < (dataSize / 2); i += blockDim.x) {
        out[i] = x[i] + y[i];
    for (; i < dataSize; i += blockDim.x) {
        out[i] = x[i] * y[i];


main() {
    uint32_t x[dataSize] = {};
    uint32_t y[dataSize] = {};
    uint32_t result[dataSize];
    testKernel<<<1,threads>>>(result, x, y);
    return 0;

Without knowing what blockDim.x is, it certainly seems to me like you could have warp divergence.

To be clear, when I said file a bug, I meant file a bug.