Comparison of a CUDA kernel performance running on different GPUs/Toolkits/Drivers

Hello,

I’m testing the following CUDA program running on different linux computing platforms:
First Platform:

  • Intel core i7 2nd gen
  • NVIDIA Quadro 1000m (Compute capability 2.0 - 96 CUDA cores - Kermi arch)
  • CUDA toolkit 5.5
  • NVIDIA driver 331.67
    Kernel execution time : Around 10 seconds.

The second one:

  • Intel i5 1rst gen
  • NVIDIA GeForce GTX 650 (Compute capability 3.0 - 384 CUDA cores - Kepler arch)
  • CUDA toolkit 5.0
  • NVIDIA driver 304.108
    Kernel execution time : Around 100 seconds (10 times slower !!)

Can anyone explain me the reason of this performance hit ?

Thank you

My kernel code:

__global__ void EuclideanDistances( float *D_img, float *D_fond , float *distances , int *indices, int R_img , int R_fond )
{
	__shared__ float accumResult;
	__shared__ float sA;
	__shared__ float sB;

	int bx = blockIdx.x;  // N
	int by = blockIdx.y;  // M
	int ty = threadIdx.y; // 128

	sA[ty] = D_img [bx * SIZE + ty];
	sB[ty] = D_fond[by * SIZE + ty];
	__syncthreads();

	accumResult[ty] = (sA[ty] - sB[ty]) * (sA[ty] - sB[ty]);
	__syncthreads();

	// Parallel tree-reduction
	for (int pas = SIZE/2 ; pas > 0 ; pas = pas / 2)
		if (ty < pas)
		{
			accumResult[ty]	+= accumResult [pas + ty];
			__syncthreads();
		}

	if ((threadIdx.y == 0)) // Only 1 Thread
		distances[bx * R_fond + by] = accumResult[ty];

}

I think you have to use cuda occupancy calculator.

In different arch you have to adjust threads per block to obtaint the best performance.

Try to reduce shared memory use, so you can get more occupancy. This is critical on Kepler, as shared memory bandwidth hasn’t increased in Kepler, but the number of CUDA cores per multiprocessor was increased greatly. Hence the effective shared memory bandwidth available to each CUDA core is lower.

So the conclusion is: try reducing your dependency on shared memory. For example, consider using a reduction based on warp shuffle instead of shared memory.

You may also want to review the Kepler tuning guide document.

What happens if you define your shared arrays with SIZE+1 ? Will it reduce shared memory bank conflicts? What is your current SIZE setting?

Christian