Hi,
I have the following code, which takes ~95% of the overall CPU+GPU code. Any idea whether this is not optimized? The performance boost
of one C1060 over one CPU thread is X15. Is this boost reasonable or should the cuFFT give a higher boost?
The 2D fft is being ran 9 * 544 times, is it a limiting factor? the fact that it is not seen by the GPU as one big chunk? I saw that batched FFT are
not yet supported…
m_nxfft ==1680
m_nyfft == 84
extern "C" int RunSecondPhaseFFT( int iDeviceId, GGPUTimeMigrationParams &sGPUParams )
{
cufftHandle plan;
GCUFFT_SAFE_CALL( iDeviceId, cufftPlan1d( &plan, sGPUParams.m_nxfft, CUFFT_C2C, 1 ) );
for ( int k = 10 - 1; k >= 0; k-- )
{
for ( int iCurrentFreq = 1; iCurrentFreq < 544; iCurrentFreq++ )
{
unsigned long lFFTStartPosition = iCurrentFreq * sGPUParams.m_iVelCount;
lFFTStartPosition *= sGPUParams.m_nxfft * sGPUParams.m_nyfft;
lFFTStartPosition += k * sGPUParams.m_nxfft * sGPUParams.m_nyfft;
GCUFFT_SAFE_CALL( iDeviceId, cufftExecC2C( plan, &( sGPUParams.m_DEVICE_pWorkArray[ lFFTStartPosition ] ),
&( sGPUParams.m_DEVICE_pWorkArray[ lFFTStartPosition ] ), CUFFT_INVERSE) );
}
}
GCUFFT_SAFE_CALL( m_iDeviceId, cufftDestroy( plan ) );
return 1;
}
I’m using CUDA 3.0 on windows
Thanks a lot,
Eyal