Reduced performance on faster GPU

Hi, I am fairly new to OpenCL and have been attempting to implement a DSP algorithm

to compare its performance on different GPU’s compared to the standard CPU implementation.

Though I have achieved a massive performance gain, what I find strange is that I get the same

gain on GT240 as a much faster GTX 480. My program executes two kernels, and while the one

speeds up on the GTX 480 the other slows down.

GT240: Kernel 1: 226us, Kernel 2: 103us

GTX 480: Kernel 1: 35us, Kernel 2: 293us.

Below is the code for Kernel 2, which is almost 3 times slower on the bigger card.


_kernel void max_curve_fit_gpu (__global float* fCorrelationResult,

                                       const int iNumAngles,

			               const int iTotalBins,

				       __global float* fDirection_rad,

				       const int iBatchIndex)


        const int iBinNum = get_global_id(0);

	const int iCorrBatchOffset = iBatchIndex*(iNumAngles*iTotalBins) + iBinNum*iNumAngles;

	const int iResultBatchOffset = iBatchIndex*iTotalBins;

	// Find the max for this bin

	float fMax = 0;

	int iMaxIndex = 0;

	for (int iAngle=0; iAngle<iNumAngles; iAngle++)


		if (fMax < fCorrelationResult[iCorrBatchOffset + iAngle])


			fMax = fCorrelationResult[iCorrBatchOffset  + iAngle];

			iMaxIndex = iAngle;



	// Do the curve fit

	float fPrev, fNext, fA, fB, fAxis;

	fPrev = fCorrelationResult[iCorrBatchOffset + (iMaxIndex-1)%iNumAngles];

	fNext = fCorrelationResult[iCorrBatchOffset + (iMaxIndex+1)%iNumAngles];

	fB = (fPrev - fNext)*0.5f;

	fA = (fNext + fPrev) - fMax*2.0f;

	fAxis = fB / fA;

	fDirection_rad[iResultBatchOffset + iBinNum] = iMaxIndex + fAxis;


Can somebody please point out what could be causing this?

Do you have the same configuration on both machines ? As in do you have the same driver and the same toolkit ?