cufft doubt comparing r2c and c2c 2D FFTs

I did a 400-point FFT on my input data using 2 methods:

  1. C2C Forward transform with length nx*ny and

  2. R2C transform with length nx*(nyh+1)

Observations when profiling the code:

Method 1 calls SP_c2c_mradix_sp_kernel 2 times resulting in 24 usec.

Method 2 calls SP_c2c_mradix_sp_kernel 12.32 usec and SP_r2c_mradix_sp_kernel 12.32 usec.

So eventually there’s no improvement in using the real-to-complex transform over the complex-to-complex transform. Theoretically, there should be an improvement as Method 2 uses only half the size of the second dimension. Am I missing something? This is also mentioned in page 21 of the CUFFT_Library_3.1 Manual.

Secondly, my results are not matching using the R2C transform between CUFFT and FFTW. Don’t know what’s the issue here…?

double* ffcorr1;

cufftComplex *f1_d;

cudaMalloc((void**) &ffcorr1, sizeof(double) * pix3);

cudaMalloc((void**) &f1_d, sizeof(cufftComplex) * pix1 * (pix2/2 + 1));

// create plan for CUDA FFT

cufftHandle plan_forward1;

CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward1, pix1, pix2, CUFFT_R2C));	

CUFFT_SAFE_CALL(cufftExecR2C(plan_forward1, (cufftReal*) ffcorr1, f1_d)); //cast double* ffcorr1 as cufftReal*

	

//Destroy CUFFT context

CUFFT_SAFE_CALL(cufftDestroy(plan_forward1));
double* ffcorr1;

fftw_complex *f1;

ffcorr1 = (double*) malloc(sizeof(double) * pix3);

f1 = fftw_malloc ( sizeof ( fftw_complex ) * pix1 * (pix2/2+1) * n);

plan_forward1 = fftw_plan_dft_r2c_2d ( pix1, pix2, ffcorr1, f1, FFTW_ESTIMATE );

fftw_execute ( plan_forward1 );

Are they at all similar? I noticed you are using single precision for CUDA and double for fftw. Try using the double precision library if your card supports it.

Are they at all similar? I noticed you are using single precision for CUDA and double for fftw. Try using the double precision library if your card supports it.

Can someone elucidate more on why the c2c_function is called when I use the r2c transform?
@mailmerge I will use the double precision API provided in the CUFFT manual. I hope it’s not more time intensive.
The FFT execution using CUFFT takes about 24usec. Is that the fastest?? (reading up on Vasily’s paper/Tech report)

Just a query: Is it wise to compare the CUFFT implementation on GPU versus an ASIC using a LUT based FFT?? I am trying to do this as (1) the GPU can do a batch FFT on the data and (2) retain precision. FFT Gurus: any feedback?

Can someone elucidate more on why the c2c_function is called when I use the r2c transform?
@mailmerge I will use the double precision API provided in the CUFFT manual. I hope it’s not more time intensive.
The FFT execution using CUFFT takes about 24usec. Is that the fastest?? (reading up on Vasily’s paper/Tech report)

Just a query: Is it wise to compare the CUFFT implementation on GPU versus an ASIC using a LUT based FFT?? I am trying to do this as (1) the GPU can do a batch FFT on the data and (2) retain precision. FFT Gurus: any feedback?

Used single precision and the values match between CUDA and C. However, is there any benefit using the r2c transform over the c2c transform in cuFFT??

Used single precision and the values match between CUDA and C. However, is there any benefit using the r2c transform over the c2c transform in cuFFT??

In CUDA 3.2, the FFT libraries perform strangely. So when I use only one FFT calculation, it takes 12.864+6.592 total time. However, when I use 2 FFT calls, the FFT behavior is optimized such that now each kernel takes (10.24+6.336)*2.

However the total execution of my code went up from 0.79 ms using CUDA 3.1 to 1.274 ms using CUDA 3.2 (used cudaEventRecord)

Has anyone seen this?? I am probably going to downgrade to CUDA 3.1

In CUDA 3.2, the FFT libraries perform strangely. So when I use only one FFT calculation, it takes 12.864+6.592 total time. However, when I use 2 FFT calls, the FFT behavior is optimized such that now each kernel takes (10.24+6.336)*2.

However the total execution of my code went up from 0.79 ms using CUDA 3.1 to 1.274 ms using CUDA 3.2 (used cudaEventRecord)

Has anyone seen this?? I am probably going to downgrade to CUDA 3.1

Yes, I have seen that the cuFFT 3.2 runs significantly slower on my GT240 for vector sizes <16K.
e.g. for a 1K complex-complex FFT, I get 22 microseconds with 3.1 and 58 microseconds with 3.2.

I hope NVIDIA address this before formal release.

Yes, I have seen that the cuFFT 3.2 runs significantly slower on my GT240 for vector sizes <16K.
e.g. for a 1K complex-complex FFT, I get 22 microseconds with 3.1 and 58 microseconds with 3.2.

I hope NVIDIA address this before formal release.

@Tetters: did u try Nukada FFT library? Lemme know if there’s any improvement?

@Tetters: did u try Nukada FFT library? Lemme know if there’s any improvement?

another observation:
a 20x20 2D FFT takes 18.4us and when using the batch 2D FFT (cufftManyPlan) for a batch size of 50, the profiler shows execution time of 664.96 us This is almost a linear increase in execution time.

another observation:
a 20x20 2D FFT takes 18.4us and when using the batch 2D FFT (cufftManyPlan) for a batch size of 50, the profiler shows execution time of 664.96 us This is almost a linear increase in execution time.

In 3.2RC1, we implemented a new transform algorithm for many non-power-of-two sizes that improved the accuracy by quite a bit. However, the version of that implementation that shipped with 3.2RC1 was missing a few performance optimizations that we have since added in order to match 3.1’s perf as closely as possible while still getting the improved accuracy.

If you can tell us some of the specific sizes you’re interested in, we can check them to be sure this is fixed now.

Thanks,

Cliff

In 3.2RC1, we implemented a new transform algorithm for many non-power-of-two sizes that improved the accuracy by quite a bit. However, the version of that implementation that shipped with 3.2RC1 was missing a few performance optimizations that we have since added in order to match 3.1’s perf as closely as possible while still getting the improved accuracy.

If you can tell us some of the specific sizes you’re interested in, we can check them to be sure this is fixed now.

Thanks,

Cliff

@Cliff: I am looking at implementing a 2D FFT of size 20x20. Since it’s not a power of 2, the performance is not that great using the CUFFT library. I am also timing the FFT kernel for a batch size of 50 as well.

Do let us know if there are improvements.

@Cliff: I am looking at implementing a 2D FFT of size 20x20. Since it’s not a power of 2, the performance is not that great using the CUFFT library. I am also timing the FFT kernel for a batch size of 50 as well.

Do let us know if there are improvements.

The sizes don’t have to be powers of two to get reasonable performance (I was oversimplifying).

So are you saying that 20x20 with batch=50 performs better for you with CUFFT 3.1 than it does with CUFFT 3.2 RC1? The CUFFT developers are telling me that that ought not be the case…

Thanks,

Cliff