cufft doubt comparing r2c and c2c 2D FFTs

vivekv80 · September 9, 2010, 3:43pm

I did a 400-point FFT on my input data using 2 methods:

C2C Forward transform with length nx*ny and
R2C transform with length nx*(nyh+1)

Observations when profiling the code:

Method 1 calls SP_c2c_mradix_sp_kernel 2 times resulting in 24 usec.

Method 2 calls SP_c2c_mradix_sp_kernel 12.32 usec and SP_r2c_mradix_sp_kernel 12.32 usec.

So eventually there’s no improvement in using the real-to-complex transform over the complex-to-complex transform. Theoretically, there should be an improvement as Method 2 uses only half the size of the second dimension. Am I missing something? This is also mentioned in page 21 of the CUFFT_Library_3.1 Manual.

Secondly, my results are not matching using the R2C transform between CUFFT and FFTW. Don’t know what’s the issue here…?

double* ffcorr1;

cufftComplex *f1_d;

cudaMalloc((void**) &ffcorr1, sizeof(double) * pix3);

cudaMalloc((void**) &f1_d, sizeof(cufftComplex) * pix1 * (pix2/2 + 1));

// create plan for CUDA FFT

cufftHandle plan_forward1;

CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward1, pix1, pix2, CUFFT_R2C));	

CUFFT_SAFE_CALL(cufftExecR2C(plan_forward1, (cufftReal*) ffcorr1, f1_d)); //cast double* ffcorr1 as cufftReal*

	

//Destroy CUFFT context

CUFFT_SAFE_CALL(cufftDestroy(plan_forward1));

double* ffcorr1;

fftw_complex *f1;

ffcorr1 = (double*) malloc(sizeof(double) * pix3);

f1 = fftw_malloc ( sizeof ( fftw_complex ) * pix1 * (pix2/2+1) * n);

plan_forward1 = fftw_plan_dft_r2c_2d ( pix1, pix2, ffcorr1, f1, FFTW_ESTIMATE );

fftw_execute ( plan_forward1 );

mailmerge · September 10, 2010, 10:21am

Are they at all similar? I noticed you are using single precision for CUDA and double for fftw. Try using the double precision library if your card supports it.

mailmerge · September 10, 2010, 10:21am

Are they at all similar? I noticed you are using single precision for CUDA and double for fftw. Try using the double precision library if your card supports it.

vivekv80 · September 10, 2010, 1:23pm

Can someone elucidate more on why the c2c_function is called when I use the r2c transform?
@mailmerge I will use the double precision API provided in the CUFFT manual. I hope it’s not more time intensive.
The FFT execution using CUFFT takes about 24usec. Is that the fastest?? (reading up on Vasily’s paper/Tech report)

Just a query: Is it wise to compare the CUFFT implementation on GPU versus an ASIC using a LUT based FFT?? I am trying to do this as (1) the GPU can do a batch FFT on the data and (2) retain precision. FFT Gurus: any feedback?

vivekv80 · September 10, 2010, 1:23pm

Can someone elucidate more on why the c2c_function is called when I use the r2c transform?
@mailmerge I will use the double precision API provided in the CUFFT manual. I hope it’s not more time intensive.
The FFT execution using CUFFT takes about 24usec. Is that the fastest?? (reading up on Vasily’s paper/Tech report)

Just a query: Is it wise to compare the CUFFT implementation on GPU versus an ASIC using a LUT based FFT?? I am trying to do this as (1) the GPU can do a batch FFT on the data and (2) retain precision. FFT Gurus: any feedback?

vivekv80 · September 15, 2010, 10:54pm

Used single precision and the values match between CUDA and C. However, is there any benefit using the r2c transform over the c2c transform in cuFFT??

vivekv80 · September 15, 2010, 10:54pm

Used single precision and the values match between CUDA and C. However, is there any benefit using the r2c transform over the c2c transform in cuFFT??

vivekv80 · September 21, 2010, 2:59pm

In CUDA 3.2, the FFT libraries perform strangely. So when I use only one FFT calculation, it takes 12.864+6.592 total time. However, when I use 2 FFT calls, the FFT behavior is optimized such that now each kernel takes (10.24+6.336)*2.

However the total execution of my code went up from 0.79 ms using CUDA 3.1 to 1.274 ms using CUDA 3.2 (used cudaEventRecord)

Has anyone seen this?? I am probably going to downgrade to CUDA 3.1

vivekv80 · September 21, 2010, 2:59pm

In CUDA 3.2, the FFT libraries perform strangely. So when I use only one FFT calculation, it takes 12.864+6.592 total time. However, when I use 2 FFT calls, the FFT behavior is optimized such that now each kernel takes (10.24+6.336)*2.

However the total execution of my code went up from 0.79 ms using CUDA 3.1 to 1.274 ms using CUDA 3.2 (used cudaEventRecord)

Has anyone seen this?? I am probably going to downgrade to CUDA 3.1

Tetters · October 7, 2010, 8:49pm

Yes, I have seen that the cuFFT 3.2 runs significantly slower on my GT240 for vector sizes <16K.
e.g. for a 1K complex-complex FFT, I get 22 microseconds with 3.1 and 58 microseconds with 3.2.

I hope NVIDIA address this before formal release.

Tetters · October 7, 2010, 8:49pm

Yes, I have seen that the cuFFT 3.2 runs significantly slower on my GT240 for vector sizes <16K.
e.g. for a 1K complex-complex FFT, I get 22 microseconds with 3.1 and 58 microseconds with 3.2.

I hope NVIDIA address this before formal release.

vivekv80 · October 7, 2010, 11:47pm

@Tetters: did u try Nukada FFT library? Lemme know if thereâ€™s any improvement?

vivekv80 · October 7, 2010, 11:47pm

@Tetters: did u try Nukada FFT library? Lemme know if thereâ€™s any improvement?

vivekv80 · October 12, 2010, 3:05pm

another observation:
a 20x20 2D FFT takes 18.4us and when using the batch 2D FFT (cufftManyPlan) for a batch size of 50, the profiler shows execution time of 664.96 us This is almost a linear increase in execution time.

vivekv80 · October 12, 2010, 3:05pm

another observation:
a 20x20 2D FFT takes 18.4us and when using the batch 2D FFT (cufftManyPlan) for a batch size of 50, the profiler shows execution time of 664.96 us This is almost a linear increase in execution time.

Cliff_Woolley · October 14, 2010, 12:12am

In 3.2RC1, we implemented a new transform algorithm for many non-power-of-two sizes that improved the accuracy by quite a bit. However, the version of that implementation that shipped with 3.2RC1 was missing a few performance optimizations that we have since added in order to match 3.1’s perf as closely as possible while still getting the improved accuracy.

If you can tell us some of the specific sizes you’re interested in, we can check them to be sure this is fixed now.

Thanks,

Cliff

Cliff_Woolley · October 14, 2010, 12:12am

In 3.2RC1, we implemented a new transform algorithm for many non-power-of-two sizes that improved the accuracy by quite a bit. However, the version of that implementation that shipped with 3.2RC1 was missing a few performance optimizations that we have since added in order to match 3.1’s perf as closely as possible while still getting the improved accuracy.

If you can tell us some of the specific sizes you’re interested in, we can check them to be sure this is fixed now.

Thanks,

Cliff

vivekv80 · October 18, 2010, 9:11pm

@Cliff: I am looking at implementing a 2D FFT of size 20x20. Since it’s not a power of 2, the performance is not that great using the CUFFT library. I am also timing the FFT kernel for a batch size of 50 as well.

Do let us know if there are improvements.

vivekv80 · October 18, 2010, 9:11pm

@Cliff: I am looking at implementing a 2D FFT of size 20x20. Since it’s not a power of 2, the performance is not that great using the CUFFT library. I am also timing the FFT kernel for a batch size of 50 as well.

Do let us know if there are improvements.

Cliff_Woolley · October 18, 2010, 11:36pm

The sizes don’t have to be powers of two to get reasonable performance (I was oversimplifying).

So are you saying that 20x20 with batch=50 performs better for you with CUFFT 3.1 than it does with CUFFT 3.2 RC1? The CUFFT developers are telling me that that ought not be the case…

Thanks,

Cliff

Topic		Replies	Views
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13482	February 17, 2012
FFT Speed vs. x86 CUDA Programming and Performance	14	24639	July 27, 2008
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241015	November 16, 2011
cuFFT Timing Jetson TX2	14	2436	October 18, 2021
NukadaFFT library CUDA Programming and Performance	128	123458	February 6, 2012
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	4932	October 29, 2019
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3154	July 10, 2014
Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin Jetson AGX Orin gpu-computing	5	119	August 28, 2024
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15484	May 4, 2010
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	6	2705	October 18, 2021

cufft doubt comparing r2c and c2c 2D FFTs

Related topics