How subject to performance loss is : if (idx < n) { .... } ?

Hey guys,

I’ve been reading more about branch divergence in GPU computing and one thing I have to ask is, is it performant to code like this?

__global__
void kernel(float *a, float *b, float *c, int n) {

     int i = blockIdx.x * blockDim.x + threadIdx.x;

     if (i >= n)
          return;

     c[i] = a[i] * b[i];

     return;
}

And if I do incur a performance loss, how do I remedy it?

I saw one random thing on the internet that said I should instead use this :

int i = blockIdx.x * blockDim.x + threadIdx.x;

i = max(i, 0);
i = min(i, n);

But I’m not sure about that because it seems like I’d be doing repetitive calculations on certain indices or rather, I’d be doing the same thing twice at the same time?

Your top code snippet is pretty common – at least in my kernels. It’s performant.

The use case is when you have a grid made up of small blocks (2 or 4 warps) with the “last” block only partially covering the input array (‘n’). If your kernels are thread-oriented then just test and exit. If they’re warp-oriented then just exit any warps that have nothing to do.

I don’t like that second snippet at all for the reasons you note.

The top code should be fine.

There are two possible sources of performance loss in the top code.

  1. The check itself isn’t free, and it is executed by every thread. Sometimes people work around this by hoisting it outside of the kernel (i.e. remove the check from the kernel, then launch two kernels, the first one only with fully covered blocks, and the second one with the remaining elements).

  2. Threads that exit early potentially sit idle until the other threads in the same warp exit. There isn’t much you can do about this, and it isn’t any worse than the bottom code that you have.

Ooh, I really like your first suggestion. I think I’ll do that so that way there’s 0 branching. Yay for asynchronous kernel launches!

Okay, I have a question about branching and warps.

I’m going to assume that it is possible to use if-statements effectively in CUDA provided that certain considerations are, well, taken into consideration.

Let’s say we have an if-else statement so obviously, we have two paths.

Now, the instant a thread hits this, it has to make a choice so what I’m assuming happens is, two separate warp instances are created (though not necessarily executed). We have warp_1 and warp_2. warp_1 is the successful “if” and warp_2 is the “else”. For maximum efficiency, it’s ideal that we fill up both of these warps, right? Like, one thread hits this statement and is transferred to warp_1, another hits it and is transferred to warp_2 and then once all the threads are done, the warps are then executed serially.

My biggest question is, I’ve written code like this :

__global__
void kernel1(int *x) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i % 2 == 0) {

        x[i] *= 2;
    } else {

        x[i] += 2;
    }

    return;
}

int main(void) {

    thrust::device_vector<int> x(128, 1);

    kernel1<<<2, 64>>>(thrust::raw_pointer_cast(x.data()));

    return 0;
}

and both the branch and warp efficiencies are 100% so this is why I think like I do. Correct me if I’m wrong.

I am brushing up things and adding things to my knowledge using various books. So I happen to be covering the number 1 problem of warp divergence. I have chosen this thread to continue and put a full stop to the queries of people. So lets built it.

John if you are still wondering/reading the reason why you take this perfect result is that the branch statement have been predicated by the compiler, I think if you look at the PTX code you will see the predicate. So if the bool statement is there with a predicate p then p and !p are scheduled for execution and only the true predicate write the result. So this is why you get the 100% efficiency due to this optimization while the true efficiency should have been 50%, In the next thread we will define the efficiency and see what we should expect as a theoretical efficiency in two key examples. I hope some expert will assist here also.

What I do not like in books is not write down in numbers the formula and just define it with words and on top write what the system gave on their system and then explain the result. Very good books but at some points they get terrible.

So lets define the Branch Efficiency = (#Branches - #Divergent Branches)/#Branches, number between (0,1].

I want someone who is completely aware of the formula to write in numbers the formula for these two examples (I know the first but not certain for the second):

__global__ void mathKernel1(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    if (tid % 2 == 0)
    {
        ia = 100.0f;
    }
    else
    {
        ib = 200.0f;
    }

    c[tid] = ia + ib;
}

And the second code is:

__global__ void mathKernel3(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    bool ipred = (tid % 2 == 0);

    if (ipred)
    {
        ia = 100.0f;
    }

    if (!ipred)
    {
        ib = 200.0f;
    }

    c[tid] = ia + ib;
}

Lets take the setup of the book also and launch 1 block with 64 threads. So there are some things for one SM to do. Please someone who is 100% certain write down in numbers the formula. Something the book has not done and also if they could explain it would be beneficiary for others.

If you understand these two examples you understand everything.

Of any two consecutive threads, one takes the first path, the other takes the second.
Modulo operation is not necessary, just check the lsb in such cases.
You seem to be launching 2 warps per block. I would read the x values into shared memory and have one of the warps work on x[k], k even and the other warp k odd. The paper on parallel prefix sum shows a way to avoid bank conflicts; it was based on 16 banks(old) newer machines have 32 banks.