 # 2D array & unique indexation In order to avoid Threads conflicts

Hi everyone !

I’m dealing with 2D arrays (array[W][H]) in CUDA since some times and since the begining, I calculate my index doing :

``````int idx=threadIdx.x + blockIdx.x* blockDim.x;

int index = idx + W*idy;
``````

The problem of this method is that different threads can have the same index, for example, in my case :

``````W=256;

H=24;

``````

So for example, for index=256, the thread (15,0) of the block (10,0) matches and the thread (0,1) of the block (0,1) matches as well.

Since now, it has not been a problem but now I need to have a unique indexation !

I also calculated the index doing :

``````int idx=threadIdx.x + blockIdx.x* blockDim.x;

int index =idy*(gridDim.x*blockDim.x)+idx;
``````

And that seems to work but only for

and that’s is not my case because 256/24 does not give a round result.

Is there other ways to calculate unique indexes ?

One way would be to do it exaclty how you are doing it, but when defining the grid round up.

ie:

``````dim3 dimGrid((W + threadPerBlock.x - 1) /threadPerBlock.x, (H + threadPerBlock.y - 1)/threadPerBlock.y)
``````

and then check inside the kenrel if the index is valid.

Your probably right but unfortunately it does not solve my problem !

So I will explain it a little more :

I have two arrays : A and B. I want to add B to every row of A and put the result into a new array Result.

In other words, if I wanted to do it in without parallel programming I would have done :

``````for(int i=0;i<256;i++)

{

for(int j=0;j<24;j++)

{

Result[i][j]=A[i][j] + B[j];

}

}
``````

At first I was doing it in that way :

``````// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* A,float* B)

{

int index =idy*(gridDim.x*blockDim.x)+idx;

int indexB=index%24;

if ( idx < W && idy < H ) result[index]= A[index] + B[indexB];

}
``````

with

It worked well but the % operation is time expensive ! So I decided to change the ThreadPerBlock to dim3 threadPerBlock(24,8) in order to use threadIdx.x as the index for the B array !

And then, if I kept my program like that or if I change it according to your suggestion, the program does not the good sum, it does not add A[i][j] with B[j] but with B[j-1], B[j-8] or every other possibilities !

The way I’m doing it is probably not “clean” but I don’t see other ways of doing it…

don’t use gridDim.x*blockDim.x to index matrix element, you should use

dimension of the matrix, H or W.

``````// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* A,float* B)

{

if ( idx < W && idy < H ){

// valid index (idx, idy)	and

// its row-major map is (idy * W + idx)

int index = idy * W + idx;

result[index]= A[index] + B[idx];	// result[idy][idx] = A[idy][idx] + B[idx]

}

}
``````