Array index in the device programm Improvment of the array index calculation in the device programm

Hello everyone !

To simplify, I had 2 2D arrays of 256*24 and one 1D array of 24. I want to add every cells of the 2D array and multiply the result by the corresponding row of the 1D array.

Here is my code :

// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* x, float* a,float* b)


  int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

int index =idx + idy*W;

  // Modulo H : idxxx=index % H

  int idxx=(int)index/H;

  int idxxx=index - 24*idxx;

if ( idx < W && idy < H ) result[index]= (a[index]+b[index])*x[idxxx];




int i=0,j=0;

  int H=24;

  int W=256;

  float h_Result[W][H];

// variables which will be used by the device

  float *d_ParamA;

  float *d_ParamB;

  float *d_ParamX;

  float *d_Result;

// Matrix allocation

  float ParamA[W][H];

  float ParamB[W][H];

  float ParamX[H];

size_t pitch_r,pitch_a,pitch_b;

// Matrix filling

  for (i=0;i<W;i++) 






		} // for j

	} // for i

  for(j=0;j<H;j++) ParamX[j]=j;

	  // Memory size definition

  dim3 threadPerBlock(32,8);

  dim3 dimGrid(W/threadPerBlock.x , H/threadPerBlock.y);

	  // Memory allocation


  cudaMallocPitch((void**) &d_Result, &pitch_r, W*sizeof(float), H);

  cudaMallocPitch((void**) &d_ParamA, &pitch_a, W*sizeof(float), H);

  cudaMallocPitch((void**) &d_ParamB, &pitch_b, W*sizeof(float), H);



	  // Memory copying






	  // Compute


// Memory copying


//Print the result  





	  printf("%5d ", (int)h_Result[i][j]);









This code works perfectly but I spent the whole afternoon drawing lots of schemes to find which index I had to choose for the X array (idxxx) !

Finally I’ve done a modulo of the index by 24 (when the index is 32 the index of the X array is 7)

But It does not look like a good way to compute the index and I still think that it exists a really easiest way to do it (in terms of computation time for the GPU)

Does anybody got an idea on this ?

I didn’t really get what problems you have with ‘%’ operator and why do you need to do your own implementation of that.

Besides that I’d replace ‘*’ with __umul24 (which is faster for current generation hardware) and would think about making your ‘x’ array power of 2 (so that you could replace ‘% H’ with ‘& H’ (edit: & (H-1) of course ;) ). You could also do several iterations of your algorithm per thread to absorb index calculation cost (as many samples in sdk do).