How subject to performance loss is : if (idx < n) { .... } ?

MutantJohn · June 17, 2014, 6:00pm

Hey guys,

I’ve been reading more about branch divergence in GPU computing and one thing I have to ask is, is it performant to code like this?

__global__
void kernel(float *a, float *b, float *c, int n) {

     int i = blockIdx.x * blockDim.x + threadIdx.x;

     if (i >= n)
          return;

     c[i] = a[i] * b[i];

     return;
}

And if I do incur a performance loss, how do I remedy it?

I saw one random thing on the internet that said I should instead use this :

int i = blockIdx.x * blockDim.x + threadIdx.x;

i = max(i, 0);
i = min(i, n);

But I’m not sure about that because it seems like I’d be doing repetitive calculations on certain indices or rather, I’d be doing the same thing twice at the same time?

allanmac · June 17, 2014, 11:32pm

Your top code snippet is pretty common – at least in my kernels. It’s performant.

The use case is when you have a grid made up of small blocks (2 or 4 warps) with the “last” block only partially covering the input array (‘n’). If your kernels are thread-oriented then just test and exit. If they’re warp-oriented then just exit any warps that have nothing to do.

I don’t like that second snippet at all for the reasons you note.

Gregory_Diamos · June 19, 2014, 12:28am

The top code should be fine.

There are two possible sources of performance loss in the top code.

The check itself isn’t free, and it is executed by every thread. Sometimes people work around this by hoisting it outside of the kernel (i.e. remove the check from the kernel, then launch two kernels, the first one only with fully covered blocks, and the second one with the remaining elements).
Threads that exit early potentially sit idle until the other threads in the same warp exit. There isn’t much you can do about this, and it isn’t any worse than the bottom code that you have.

MutantJohn · June 19, 2014, 6:39pm

Ooh, I really like your first suggestion. I think I’ll do that so that way there’s 0 branching. Yay for asynchronous kernel launches!

MutantJohn · June 21, 2014, 2:30am

Okay, I have a question about branching and warps.

I’m going to assume that it is possible to use if-statements effectively in CUDA provided that certain considerations are, well, taken into consideration.

Let’s say we have an if-else statement so obviously, we have two paths.

Now, the instant a thread hits this, it has to make a choice so what I’m assuming happens is, two separate warp instances are created (though not necessarily executed). We have warp_1 and warp_2. warp_1 is the successful “if” and warp_2 is the “else”. For maximum efficiency, it’s ideal that we fill up both of these warps, right? Like, one thread hits this statement and is transferred to warp_1, another hits it and is transferred to warp_2 and then once all the threads are done, the warps are then executed serially.

My biggest question is, I’ve written code like this :

__global__
void kernel1(int *x) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i % 2 == 0) {

        x[i] *= 2;
    } else {

        x[i] += 2;
    }

    return;
}

int main(void) {

    thrust::device_vector<int> x(128, 1);

    kernel1<<<2, 64>>>(thrust::raw_pointer_cast(x.data()));

    return 0;
}

and both the branch and warp efficiencies are 100% so this is why I think like I do. Correct me if I’m wrong.

AlexanderAgathos · July 12, 2015, 9:25pm

Okay, I have a question about branching and warps.

I’m going to assume that it is possible to use if-statements effectively in CUDA provided that certain considerations are, well, taken into consideration.

Let’s say we have an if-else statement so obviously, we have two paths.

Now, the instant a thread hits this, it has to make a choice so what I’m assuming happens is, two separate warp instances are created (though not necessarily executed). We have warp_1 and warp_2. warp_1 is the successful “if” and warp_2 is the “else”. For maximum efficiency, it’s ideal that we fill up both of these warps, right? Like, one thread hits this statement and is transferred to warp_1, another hits it and is transferred to warp_2 and then once all the threads are done, the warps are then executed serially.

My biggest question is, I’ve written code like this :
__global__
void kernel1(int *x) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i % 2 == 0) {

        x[i] *= 2;
    } else {

        x[i] += 2;
    }

    return;
}

int main(void) {

    thrust::device_vector<int> x(128, 1);

    kernel1<<<2, 64>>>(thrust::raw_pointer_cast(x.data()));

    return 0;
}
and both the branch and warp efficiencies are 100% so this is why I think like I do. Correct me if I’m wrong.

I am brushing up things and adding things to my knowledge using various books. So I happen to be covering the number 1 problem of warp divergence. I have chosen this thread to continue and put a full stop to the queries of people. So lets built it.

John if you are still wondering/reading the reason why you take this perfect result is that the branch statement have been predicated by the compiler, I think if you look at the PTX code you will see the predicate. So if the bool statement is there with a predicate p then p and !p are scheduled for execution and only the true predicate write the result. So this is why you get the 100% efficiency due to this optimization while the true efficiency should have been 50%, In the next thread we will define the efficiency and see what we should expect as a theoretical efficiency in two key examples. I hope some expert will assist here also.

AlexanderAgathos · July 12, 2015, 9:49pm

What I do not like in books is not write down in numbers the formula and just define it with words and on top write what the system gave on their system and then explain the result. Very good books but at some points they get terrible.

So lets define the Branch Efficiency = (#Branches - #Divergent Branches)/#Branches, number between (0,1].

I want someone who is completely aware of the formula to write in numbers the formula for these two examples (I know the first but not certain for the second):

__global__ void mathKernel1(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    if (tid % 2 == 0)
    {
        ia = 100.0f;
    }
    else
    {
        ib = 200.0f;
    }

    c[tid] = ia + ib;
}

And the second code is:

__global__ void mathKernel3(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    bool ipred = (tid % 2 == 0);

    if (ipred)
    {
        ia = 100.0f;
    }

    if (!ipred)
    {
        ib = 200.0f;
    }

    c[tid] = ia + ib;
}

Lets take the setup of the book also and launch 1 block with 64 threads. So there are some things for one SM to do. Please someone who is 100% certain write down in numbers the formula. Something the book has not done and also if they could explain it would be beneficiary for others.

If you understand these two examples you understand everything.

Vectorizer · July 13, 2015, 11:21pm

Okay, I have a question about branching and warps.

I’m going to assume that it is possible to use if-statements effectively in CUDA provided that certain considerations are, well, taken into consideration.

Let’s say we have an if-else statement so obviously, we have two paths.

Now, the instant a thread hits this, it has to make a choice so what I’m assuming happens is, two separate warp instances are created (though not necessarily executed). We have warp_1 and warp_2. warp_1 is the successful “if” and warp_2 is the “else”. For maximum efficiency, it’s ideal that we fill up both of these warps, right? Like, one thread hits this statement and is transferred to warp_1, another hits it and is transferred to warp_2 and then once all the threads are done, the warps are then executed serially.

My biggest question is, I’ve written code like this :
__global__
void kernel1(int *x) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i % 2 == 0) {

        x[i] *= 2;
    } else {

        x[i] += 2;
    }

    return;
}

int main(void) {

    thrust::device_vector<int> x(128, 1);

    kernel1<<<2, 64>>>(thrust::raw_pointer_cast(x.data()));

    return 0;
}
and both the branch and warp efficiencies are 100% so this is why I think like I do. Correct me if I’m wrong.

Of any two consecutive threads, one takes the first path, the other takes the second.
Modulo operation is not necessary, just check the lsb in such cases.
You seem to be launching 2 warps per block. I would read the x values into shared memory and have one of the warps work on x[k], k even and the other warp k odd. The paper on parallel prefix sum shows a way to avoid bank conflicts; it was based on 16 banks(old) newer machines have 32 banks.

Topic		Replies	Views
Question about divergent branching CUDA Programming and Performance	3	6504	May 21, 2009
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	14305	June 14, 2009
If loops in kernel a problem? CUDA Programming and Performance	3	1796	February 26, 2009
About if-else between warps CUDA Programming and Performance	2	450	July 26, 2023
Branching in kernel CUDA Programming and Performance	3	5421	June 5, 2008
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3051	December 28, 2008
cant call any kernel function CUDA Programming and Performance	8	4933	June 6, 2011
Single Branch Divergence? [beginner question] CUDA Programming and Performance	3	1222	January 6, 2016
Avoid branching ... CUDA Programming and Performance	3	3704	May 19, 2010
Thread divergence due to IF CUDA Programming and Performance	3	6914	September 13, 2007

How subject to performance loss is : if (idx < n) { .... } ?

Related topics