cuFFT performance question

Hi,

I have the following code, which takes ~95% of the overall CPU+GPU code. Any idea whether this is not optimized? The performance boost

of one C1060 over one CPU thread is X15. Is this boost reasonable or should the cuFFT give a higher boost?

The 2D fft is being ran 9 * 544 times, is it a limiting factor? the fact that it is not seen by the GPU as one big chunk? I saw that batched FFT are

not yet supported…

m_nxfft ==1680

m_nyfft == 84

extern "C" int RunSecondPhaseFFT( int iDeviceId, GGPUTimeMigrationParams &sGPUParams )

{

	cufftHandle plan;

	GCUFFT_SAFE_CALL( iDeviceId, cufftPlan1d( &plan, sGPUParams.m_nxfft, CUFFT_C2C, 1 ) );

	for ( int k = 10 - 1; k >= 0; k-- ) 

	{

		for ( int iCurrentFreq = 1; iCurrentFreq < 544; iCurrentFreq++ ) 

		{

			unsigned long lFFTStartPosition = iCurrentFreq * sGPUParams.m_iVelCount;

			lFFTStartPosition *= sGPUParams.m_nxfft * sGPUParams.m_nyfft;

			lFFTStartPosition += k * sGPUParams.m_nxfft * sGPUParams.m_nyfft;

			GCUFFT_SAFE_CALL( iDeviceId, cufftExecC2C( plan, &( sGPUParams.m_DEVICE_pWorkArray[ lFFTStartPosition ] ), 

						&( sGPUParams.m_DEVICE_pWorkArray[ lFFTStartPosition ] ), CUFFT_INVERSE) );

		}

	}

	GCUFFT_SAFE_CALL( m_iDeviceId, cufftDestroy( plan ) );

	return 1;

}

I’m using CUDA 3.0 on windows

Thanks a lot,

Eyal

Hi,

I have the following code, which takes ~95% of the overall CPU+GPU code. Any idea whether this is not optimized? The performance boost

of one C1060 over one CPU thread is X15. Is this boost reasonable or should the cuFFT give a higher boost?

The 2D fft is being ran 9 * 544 times, is it a limiting factor? the fact that it is not seen by the GPU as one big chunk? I saw that batched FFT are

not yet supported…

m_nxfft ==1680

m_nyfft == 84

extern "C" int RunSecondPhaseFFT( int iDeviceId, GGPUTimeMigrationParams &sGPUParams )

{

	cufftHandle plan;

	GCUFFT_SAFE_CALL( iDeviceId, cufftPlan1d( &plan, sGPUParams.m_nxfft, CUFFT_C2C, 1 ) );

	for ( int k = 10 - 1; k >= 0; k-- ) 

	{

		for ( int iCurrentFreq = 1; iCurrentFreq < 544; iCurrentFreq++ ) 

		{

			unsigned long lFFTStartPosition = iCurrentFreq * sGPUParams.m_iVelCount;

			lFFTStartPosition *= sGPUParams.m_nxfft * sGPUParams.m_nyfft;

			lFFTStartPosition += k * sGPUParams.m_nxfft * sGPUParams.m_nyfft;

			GCUFFT_SAFE_CALL( iDeviceId, cufftExecC2C( plan, &( sGPUParams.m_DEVICE_pWorkArray[ lFFTStartPosition ] ), 

						&( sGPUParams.m_DEVICE_pWorkArray[ lFFTStartPosition ] ), CUFFT_INVERSE) );

		}

	}

	GCUFFT_SAFE_CALL( m_iDeviceId, cufftDestroy( plan ) );

	return 1;

}

I’m using CUDA 3.0 on windows

Thanks a lot,

Eyal