Performance issue

Below is a CPU code and its CUDA code of an application. The CPU code takes 60 seconds to run, whereas my CUDA code takes 28 seconds to complete. My requirement is to complete app in 7 seconds. so as achieve performance of atleast 5x.

My CPU code

void compute()

{

for (int i = 0; i < NfactorsPerCol; i++ , currentRowInpPtr +=NfactorsPerRow ) 

{

for (int  j = 0; j < NfactorsPerRow; j++)

{

   ....

   ....

   for (; k<MaxOffset; k++, j++)

  {

          acc += linear_light_array[k] * brightness[j];

  }

}

}

My cuda code picec.

__global__ void Compute(int NfactorsPerRow,

                            int NfactorsPerCol, ..... )

{

int currPos = (blockIdx.y * blockDim.y * NfactorsPerRow) + (threadIdx.z * NfactorsPerRow * NfactorsPerCol) + (threadIdx.y * NfactorsPerRow) + (blockIdx.x * blockDim.x) + threadIdx.x;

int currRowPos = (blockIdx.y * blockDim.y * NfactorsPerRow) + 

(threadIdx.z * NfactorsPerRow * NfactorsPerRow) + (threadIdx.y * NfactorsPerRow);

...

for (; index < = MaxOffset; index++, j++)

{

	acc += tex1Dfetch(tex1, currRowPos + k) * tex1Dfetch(tex2, j);

}

}

Please suggest more optimization. I need to achieve the desired level.

You could use the fast 24-bit integer multiplication. There’s an example in the convolutionSeparable example. Don’t know if it will matter.

Then, it looks like you could do global memory reads, possibly into shared, instead reading from texture. I didn’t look closely at your memory access pattern, but it should work.

Edit: As a general hint, try to map the problem to use one thread per memory access. More threads are almost always better, especially when you have memory reads. Is your task multiplying each pixel in a big 2D array by each pixel in a smaller array, or something like that? Maybe you can put the smaller array into shared or constant memory. The convolution examples look like good examples for you. Sorry in advance if I’m misleading you – I haven’t looked that closely at your code.

I tried to implement using shared memory, Getting weired results.

Can I do convolution kind of thing using built in CUFFT library. Are there any API suitable which I can use for doing convolution.

Thanks in advance.

Have you looked at the example projects, convolutionSeparable and convolutionTexture? They’re a bit complicated, but the speed is very good, and they are very educational once you understand them.