@happyjack272: Thanks for the suggestions.
I have some results for 2 scenarios:
CPU = AMD Phenom™ 9950 Quad-Core Processor clocked at 2.6 GHz
GPU = Fermi C2050
CUDA version 3.2
Also compiled FFTW3 for CentOS with -sse2
I turned off the cpufrequency scaling and timed the FFT execution on the CPU. (used sudo service cpuspeed stop)
2D FFT of 20x20 elements: The CPU execution time was around 9 usecs (used gettimeofday function).
At the same time, the GPU execution time was 21.024 usecs (CUDA 3.2 on a the Fermi C2050) (profile information).
2D FFT of 32x32 elements: The CPU execution time was around 19 usecs and the GPU execution time was 23.968 usecs.
Although it is evident from this analysis that the 2D FFT of 32x32 elements takes more time, it is not true. I also timed the complete program on the GPU (GPU clock) and also the time taken by the kernel to execute using the CPU clock.
The results are as follows:
For case 1: FFTW execution time on GPU = 580.000000 usec (timed using CPU clock)
Elapsed time on GPU = 0.89680 ms (timed using GPU clock)
For case 2: FFTW on device = 363.000000 usec (timed using CPU clock)
Elapsed time on GPU = 0.69152 ms (timed using GPU clock) (i.e using cudaEventRecord)