Thread deviation and broadcast problem

Hello experts, External Image

I have a problem regarding how to remove thread deviation and decrease the running time of my program.

Here’s a simple kernel that has to set all values of an array to 999 except a[s]=0.

(Single block, 1D threads)

__global__ void iss(int *a,int s)

{

    int u=threadIdx.x;

a[u]=999;

    if(u==s)

        a[u]=0;

}

Here the thread no. ‘s’ deviates from other.

One possible solution which I guess is

__global__ void iss(int *a,int s)

{

    int u=threadIdx.x;

a[u]=999;

    a[s]=0;

}

But I think broadcast occurs here which would again serialize the process.

Can anyone give an efficient code for that.

Or if it is not possible, which one of the above two would be better.

(If you are familiar with graphs, this is actually the INITAILIZE_SINGLE_SOURCE function of DIJIKSTRA’S algorithm.)

External Image in advance!

You definitely want the first variant as the second has loads of unnecessary memory transactions. And since the kernel is memory bandwidth bound, this is the only thing to optimize for.

The compiler will hopefully optimize your first version to generate the same code as

__global__ void iss(int *a,int s)

{

    int u=threadIdx.x;

a[u] = (u==s) ? 0 : 999;

}

but you might want to help the optimizer by making that transformation yourself.

Thanks for your reply. I got it.

But can u suggest any formula(replacing “if”)to do that such that no thread deviation occurs.

Any hint is highly appreciated.

Hint: Look at the code you just cited…

BTW I just tested it and the compiler is not able to optimize your first kernel. Not that it makes any difference, as the kernel is memory bandwidth bound anyway.