cufft doubt comparing r2c and c2c 2D FFTs

The sizes don’t have to be powers of two to get reasonable performance (I was oversimplifying).

So are you saying that 20x20 with batch=50 performs better for you with CUFFT 3.1 than it does with CUFFT 3.2 RC1? The CUFFT developers are telling me that that ought not be the case…

Thanks,

Cliff

When doing a 20x20 2D FFT using CUFFT, I see SP_c2c_mradix_sp_kernel called 2 times along with a memcopyHtoD. Why the memcopyH2D when I am doing a zerocopy. The profiler gives me 20.288 us (GPU Time) for SP_c2c_mradix_sp_kernel + 4.48 us(GPU Time) for memcopyHtoD.
I also measured the time using cudaEventRecord and I see 0.782624 ms

These values are slower than execution time using the FFTW library with -sse2 optimization on a 2.9 GHz Nehalem CPU. Suggestions on improvement??

When doing a 20x20 2D FFT using CUFFT, I see SP_c2c_mradix_sp_kernel called 2 times along with a memcopyHtoD. Why the memcopyH2D when I am doing a zerocopy. The profiler gives me 20.288 us (GPU Time) for SP_c2c_mradix_sp_kernel + 4.48 us(GPU Time) for memcopyHtoD.
I also measured the time using cudaEventRecord and I see 0.782624 ms

These values are slower than execution time using the FFTW library with -sse2 optimization on a 2.9 GHz Nehalem CPU. Suggestions on improvement??

@Cliff: I have posted regd CUFFT + CUDA 3.2 execution on C1060 and C2050 in the forums http://forums.nvidia.com/index.php?showtopic=184059
ANy suggestions?

@Cliff: I have posted regd CUFFT + CUDA 3.2 execution on C1060 and C2050 in the forums http://forums.nvidia.com/index.php?showtopic=184059
ANy suggestions?

host-to-device and device-to-host memory transfers can be big performance eaters. try doing a few fft’s on the device w/the same data vs. lots, and compare, so you can get a linear relationship:

time = a + b*(number of fft’s)

then roughly speaking a is your d-to-h / h-to-d memory transfer time, and b is the actual gpu compute time.

taking the memory transfer cost into account, it may sometimes be more economical to do the compute on cpu than on the gpu, esp. w/ smaller data sets (which 20x20 certainly is).

but since fft is a worse than O(N) algorithm, your performance gains on the gpu over the cpu, even if you include memory transfer, will get better and better with larger data sets.

host-to-device and device-to-host memory transfers can be big performance eaters. try doing a few fft’s on the device w/the same data vs. lots, and compare, so you can get a linear relationship:

time = a + b*(number of fft’s)

then roughly speaking a is your d-to-h / h-to-d memory transfer time, and b is the actual gpu compute time.

taking the memory transfer cost into account, it may sometimes be more economical to do the compute on cpu than on the gpu, esp. w/ smaller data sets (which 20x20 certainly is).

but since fft is a worse than O(N) algorithm, your performance gains on the gpu over the cpu, even if you include memory transfer, will get better and better with larger data sets.

@happyjack272: Thanks for the suggestions.

I have some results for 2 scenarios:
CPU = AMD Phenom™ 9950 Quad-Core Processor clocked at 2.6 GHz
GPU = Fermi C2050
CUDA version 3.2
Also compiled FFTW3 for CentOS with -sse2

I turned off the cpufrequency scaling and timed the FFT execution on the CPU. (used sudo service cpuspeed stop)

Case 1:
2D FFT of 20x20 elements: The CPU execution time was around 9 usecs (used gettimeofday function).
At the same time, the GPU execution time was 21.024 usecs (CUDA 3.2 on a the Fermi C2050) (profile information).

Case 2:
2D FFT of 32x32 elements: The CPU execution time was around 19 usecs and the GPU execution time was 23.968 usecs.

Although it is evident from this analysis that the 2D FFT of 32x32 elements takes more time, it is not true. I also timed the complete program on the GPU (GPU clock) and also the time taken by the kernel to execute using the CPU clock.

The results are as follows:
For case 1: FFTW execution time on GPU = 580.000000 usec (timed using CPU clock)
Elapsed time on GPU = 0.89680 ms (timed using GPU clock)

For case 2: FFTW on device = 363.000000 usec (timed using CPU clock)
Elapsed time on GPU = 0.69152 ms (timed using GPU clock) (i.e using cudaEventRecord)

@happyjack272: Thanks for the suggestions.

I have some results for 2 scenarios:
CPU = AMD Phenom™ 9950 Quad-Core Processor clocked at 2.6 GHz
GPU = Fermi C2050
CUDA version 3.2
Also compiled FFTW3 for CentOS with -sse2

I turned off the cpufrequency scaling and timed the FFT execution on the CPU. (used sudo service cpuspeed stop)

Case 1:
2D FFT of 20x20 elements: The CPU execution time was around 9 usecs (used gettimeofday function).
At the same time, the GPU execution time was 21.024 usecs (CUDA 3.2 on a the Fermi C2050) (profile information).

Case 2:
2D FFT of 32x32 elements: The CPU execution time was around 19 usecs and the GPU execution time was 23.968 usecs.

Although it is evident from this analysis that the 2D FFT of 32x32 elements takes more time, it is not true. I also timed the complete program on the GPU (GPU clock) and also the time taken by the kernel to execute using the CPU clock.

The results are as follows:
For case 1: FFTW execution time on GPU = 580.000000 usec (timed using CPU clock)
Elapsed time on GPU = 0.89680 ms (timed using GPU clock)

For case 2: FFTW on device = 363.000000 usec (timed using CPU clock)
Elapsed time on GPU = 0.69152 ms (timed using GPU clock) (i.e using cudaEventRecord)