2D array & unique indexation In order to avoid Threads conflicts

Hi everyone !

I’m dealing with 2D arrays (array[W][H]) in CUDA since some times and since the begining, I calculate my index doing :

int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

  int index = idx + W*idy;

The problem of this method is that different threads can have the same index, for example, in my case :

W=256;

  H=24;

  dim3 threadPerBlock(24,8);

  dim3 dimGrid((W/threadPerBlock.x)+1, (H/threadPerBlock.y)+1); // which means dim3 dimGrid(11,4)

So for example, for index=256, the thread (15,0) of the block (10,0) matches and the thread (0,1) of the block (0,1) matches as well.

Since now, it has not been a problem but now I need to have a unique indexation !

I also calculated the index doing :

int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

  int index =idy*(gridDim.x*blockDim.x)+idx;

And that seems to work but only for

and that’s is not my case because 256/24 does not give a round result.

Is there other ways to calculate unique indexes ?

One way would be to do it exaclty how you are doing it, but when defining the grid round up.

ie:

dim3 dimGrid((W + threadPerBlock.x - 1) /threadPerBlock.x, (H + threadPerBlock.y - 1)/threadPerBlock.y)

and then check inside the kenrel if the index is valid.

Thanks for your answer !

Your probably right but unfortunately it does not solve my problem !

So I will explain it a little more :

I have two arrays : A[256][24] and B[24]. I want to add B to every row of A and put the result into a new array Result[256][24].

In other words, if I wanted to do it in without parallel programming I would have done :

for(int i=0;i<256;i++)

  {

	  for(int j=0;j<24;j++)

	  {

		Result[i][j]=A[i][j] + B[j];

	  }

  }

At first I was doing it in that way :

// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* A,float* B)

{

  int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

int index =idy*(gridDim.x*blockDim.x)+idx;

  int indexB=index%24;

if ( idx < W && idy < H ) result[index]= A[index] + B[indexB];

}

with

It worked well but the % operation is time expensive ! So I decided to change the ThreadPerBlock to dim3 threadPerBlock(24,8) in order to use threadIdx.x as the index for the B array !

And then, if I kept my program like that or if I change it according to your suggestion, the program does not the good sum, it does not add A[i][j] with B[j] but with B[j-1], B[j-8] or every other possibilities !

The way I’m doing it is probably not “clean” but I don’t see other ways of doing it…

don’t use gridDim.x*blockDim.x to index matrix element, you should use

dimension of the matrix, H or W.

// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* A,float* B)

{

  int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

	if ( idx < W && idy < H ){

// valid index (idx, idy)	and 

// its row-major map is (idy * W + idx)

		int index = idy * W + idx;

		result[index]= A[index] + B[idx];	// result[idy][idx] = A[idy][idx] + B[idx]

	}

}