 # Array index in the device programm Improvment of the array index calculation in the device programm

Hello everyone !

To simplify, I had 2 2D arrays of 256*24 and one 1D array of 24. I want to add every cells of the 2D array and multiply the result by the corresponding row of the 1D array.

Here is my code :

``````// Kernel that executes on the CUDA device

__global__ void compute(float *result, int H, int W, float* x, float* a,float* b)

{

int index =idx + idy*W;

// Modulo H : idxxx=index % H

int idxx=(int)index/H;

int idxxx=index - 24*idxx;

if ( idx < W && idy < H ) result[index]= (a[index]+b[index])*x[idxxx];

}

main()

{

int i=0,j=0;

int H=24;

int W=256;

float h_Result[W][H];

// variables which will be used by the device

float *d_ParamA;

float *d_ParamB;

float *d_ParamX;

float *d_Result;

// Matrix allocation

float ParamA[W][H];

float ParamB[W][H];

float ParamX[H];

size_t pitch_r,pitch_a,pitch_b;

// Matrix filling

for (i=0;i<W;i++)

{

for(j=0;j<H;j++)

{

ParamA[i][j]=i+1;

ParamB[i][j]=2*(i+1);

} // for j

} // for i

for(j=0;j<H;j++) ParamX[j]=j;

// Memory size definition

// Memory allocation

//2D

cudaMallocPitch((void**) &d_Result, &pitch_r, W*sizeof(float), H);

cudaMallocPitch((void**) &d_ParamA, &pitch_a, W*sizeof(float), H);

cudaMallocPitch((void**) &d_ParamB, &pitch_b, W*sizeof(float), H);

//1D

cudaMalloc((void**)&d_ParamX,H*sizeof(float));

// Memory copying

//2D

cudaMemcpy2D(d_ParamA,pitch_a,ParamA,W*sizeof(float),W*sizeof(float),H,cudaMemcpyHostToDevice);

cudaMemcpy2D(d_ParamB,pitch_b,ParamB,W*sizeof(float),W*sizeof(float),H,cudaMemcpyHostToDevice);

//1D

cudaMemcpy(d_ParamX,ParamX,H*sizeof(float),cudaMemcpyHostToDevice);

// Compute

// Memory copying

cudaMemcpy2D(h_Result,W*sizeof(float),d_Result,W*sizeof(float),W*sizeof(float),H,cudaMemcpyDeviceToHost);

//Print the result

for(i=0;i<W;i++)

{

for(j=0;j<H;j++)

{

printf("%5d ", (int)h_Result[i][j]);

}

printf("\n");

}

cudaFree(d_ParamX);

cudaFree(d_Result);

cudaFree(d_ParamA);

cudaFree(d_ParamB);

}
``````

This code works perfectly but I spent the whole afternoon drawing lots of schemes to find which index I had to choose for the X array (idxxx) !

Finally I’ve done a modulo of the index by 24 (when the index is 32 the index of the X array is 7)

But It does not look like a good way to compute the index and I still think that it exists a really easiest way to do it (in terms of computation time for the GPU)

Does anybody got an idea on this ?

I didn’t really get what problems you have with ‘%’ operator and why do you need to do your own implementation of that.

Besides that I’d replace ‘*’ with __umul24 (which is faster for current generation hardware) and would think about making your ‘x’ array power of 2 (so that you could replace ‘% H’ with ‘& H’ (edit: & (H-1) of course ;) ). You could also do several iterations of your algorithm per thread to absorb index calculation cost (as many samples in sdk do).