Performance of CuFFT 3.1 library

Hi

I am using Cufft library version 3.1 and comparing CUFFT 1D running on NVidia GTX260 (216)with MATLAB FFT running on a CPU. I know CPU is better for small fft size (<1024) but with using BATCHED FFT, CuFFT is expected to be better with any fft size.

I use power of two sizes and GPUmat wrapper to CuFFT API. I always getting FFT on CPU is much better than CuFFT on GPU for fft size below 2048 even with using Batch FFT (total number of points is fixed at 8 Million)!(see attached figures). Can anyone explian or suggest me anything to do.

Here is the Matlab code:

function d_A = GPUfft(d_A,d_B,N,Batch)

fftType = cufftType;

fftDir  = cufftTransformDirections;

% FFT plan

plan = 0;

[status, plan] = cufftPlan1d(plan, N,  fftType.CUFFT_C2C, Batch);

cufftCheckStatus(status, 'Error in cufftPlan1D');

% Run GPU FFT

[status] = cufftExecC2C(plan, getPtr(d_A), getPtr(d_B), fftDir.CUFFT_FORWARD);

cufftCheckStatus(status, 'Error in cufftExecC2C');

% Run GPU IFFT

 [status] = cufftExecC2C(plan, getPtr(d_B), getPtr(d_A), fftDir.CUFFT_INVERSE);

 cufftCheckStatus(status, 'Error in cufftExecC2C');

% results should be scaled by 1/N if compared to CPU

% h_B = 1/N.*single(d_A);

[status] = cufftDestroy(plan);

cufftCheckStatus(status, 'Error in cuffDestroyPlan');

end

Results:

=============================

GPU time for 2 = 0.204395

CPU timefor 2 = 0.000090

GPU time for 4 = 0.014395

CPU timefor 4 = 0.000029

GPU time for 8 = 0.014310

CPU timefor 8 = 0.000027

GPU time for 16 = 0.013884

CPU timefor 16 = 0.000021

GPU time for 32 = 0.014274

CPU timefor 32 = 0.000031

GPU time for 64 = 0.014726

CPU timefor 64 = 0.000069

GPU time for 128 = 0.014784

CPU timefor 128 = 0.000181

GPU time for 256 = 0.015566

CPU timefor 256 = 0.000527

GPU time for 512 = 0.014721

CPU timefor 512 = 0.001977

GPU time for 1024 = 0.017689

CPU timefor 1024 = 0.007305

GPU time for 2048 = 0.020455

CPU timefor 2048 = 0.025084

GPU time for 4096 = 0.021909

CPU timefor 4096 = 0.103657

GPU time for 8192 = 0.026931

CPU timefor 8192 = 0.465617

GPU time for 16384 = 0.032093

CPU timefor 16384 = 2.494288

============================================
CUFFT_plot2.jpg
CUFFT_plot1.jpg

CUFFT_plot1.jpg