Array index in the device programm Improvment of the array index calculation in the device programm

Hello everyone !

To simplify, I had 2 2D arrays of 256*24 and one 1D array of 24. I want to add every cells of the 2D array and multiply the result by the corresponding row of the 1D array.

Here is my code :

// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* x, float* a,float* b)

{

  int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

int index =idx + idy*W;

  // Modulo H : idxxx=index % H

  int idxx=(int)index/H;

  int idxxx=index - 24*idxx;

if ( idx < W && idy < H ) result[index]= (a[index]+b[index])*x[idxxx];

}

main()

{

int i=0,j=0;

  int H=24;

  int W=256;

  float h_Result[W][H];

// variables which will be used by the device

  float *d_ParamA;

  float *d_ParamB;

  float *d_ParamX;

  float *d_Result;

// Matrix allocation

  float ParamA[W][H];

  float ParamB[W][H];

  float ParamX[H];

size_t pitch_r,pitch_a,pitch_b;

// Matrix filling

  for (i=0;i<W;i++) 

	{

	for(j=0;j<H;j++) 

		{		

		ParamA[i][j]=i+1;

		ParamB[i][j]=2*(i+1);

		} // for j

	} // for i

  for(j=0;j<H;j++) ParamX[j]=j;

	  // Memory size definition

  dim3 threadPerBlock(32,8);

  dim3 dimGrid(W/threadPerBlock.x , H/threadPerBlock.y);

	  // Memory allocation

  //2D

  cudaMallocPitch((void**) &d_Result, &pitch_r, W*sizeof(float), H);

  cudaMallocPitch((void**) &d_ParamA, &pitch_a, W*sizeof(float), H);

  cudaMallocPitch((void**) &d_ParamB, &pitch_b, W*sizeof(float), H);

  //1D

  cudaMalloc((void**)&d_ParamX,H*sizeof(float));

	  // Memory copying

  //2D

  cudaMemcpy2D(d_ParamA,pitch_a,ParamA,W*sizeof(float),W*sizeof(float),H,cudaMemcpyHostToDevice);

  cudaMemcpy2D(d_ParamB,pitch_b,ParamB,W*sizeof(float),W*sizeof(float),H,cudaMemcpyHostToDevice);

  //1D

  cudaMemcpy(d_ParamX,ParamX,H*sizeof(float),cudaMemcpyHostToDevice); 

	  // Compute

  compute<<<dimGrid,threadPerBlock>>>(d_Result,H,W,d_ParamX,d_ParamA,d_ParamB);

// Memory copying

  cudaMemcpy2D(h_Result,W*sizeof(float),d_Result,W*sizeof(float),W*sizeof(float),H,cudaMemcpyDeviceToHost);

//Print the result  

  for(i=0;i<W;i++)

  {

	  for(j=0;j<H;j++)

	  {

	  printf("%5d ", (int)h_Result[i][j]);

	  }

	  printf("\n");

  }

cudaFree(d_ParamX);

  cudaFree(d_Result);

  cudaFree(d_ParamA);

  cudaFree(d_ParamB);

}

This code works perfectly but I spent the whole afternoon drawing lots of schemes to find which index I had to choose for the X array (idxxx) !

Finally I’ve done a modulo of the index by 24 (when the index is 32 the index of the X array is 7)

But It does not look like a good way to compute the index and I still think that it exists a really easiest way to do it (in terms of computation time for the GPU)

Does anybody got an idea on this ?

I didn’t really get what problems you have with ‘%’ operator and why do you need to do your own implementation of that.

Besides that I’d replace ‘*’ with __umul24 (which is faster for current generation hardware) and would think about making your ‘x’ array power of 2 (so that you could replace ‘% H’ with ‘& H’ (edit: & (H-1) of course ;) ). You could also do several iterations of your algorithm per thread to absorb index calculation cost (as many samples in sdk do).