cufft performance

nat1 · March 9, 2011, 7:36pm

I’m trying to utilize cufft in a scientific library I work on, and I’m not sure what kind of performance gain I should be expecting. Specifically, I’ve seen some claims for the speed of 3D transforms that are vastly different than what I’m seeing, and there are other reasons to believe that I may be doing something wrong in my code.

Brief summary: the app is a large set of Python modules, with the computationally intensive parts written in C++ and exposed as Python functions using Boost.Python. We rely heavily on 3D FFTs, always in double-precision, using a C++ port of the fftpack library. Although we have code to link to FFTW, we do not actually use it in production due to license issues. However, since cufft uses an FFTW-like API, it was relatively easy for me to modify our FFTW interface to use cufft instead, without any additional conversion of our native data structures. Both the CPU and GPU transforms are done in-place. The results of cufft and fftpack are close enough to identical in most cases (the exception so far is a 512x512x512 real-to-complex transform, for reasons I haven’t determined yet but which may have something to do with FFTW compatibility mode).

The dimensions of our transforms depend on the specific input data, and are not easily refactored; a typical real-world example was 100x100x72. However, I’ve been testing with powers-of-two sized transforms (e.g. 128x128x128) and irregular sizes. In both cases, the speedup from cufft over single-threaded fftpack is only about 3-4x on a 8-core Intel 5530 system (RedHat 4) with a single Tesla C2050 card. (The memory transfer overhead is approximately 10% of the overall runtime.) For a relatively large transform, on the order of 500x500x500, the overall runtime for the transform alone is approximately 4.5 seconds. (After searching the forum, I added cudaThreadSynchronize() after cufftExec* before determining the runtime, and I’m timing the memory transfers separately, so I’m pretty confident that I’m doing this part right.) What is somewhat more perplexing is that the speedup does not appear to depend on the transform dimensions - it is very consistent across everything I’ve tried, except for some smaller powers-of-2 sizes where cufft is actually slower.

Other details:

I’m compiling and linking with gcc/g++ 4.1 right now. However, I also tried compiling my C++ sources using nvcc, with no change in outcome.
running “ldd” on my library definitely shows it linking to libcufft.so, not the emulation library.
I ran into the initialization overhead almost immediately, so all of my testing is done by running each transform once, then timing 4-8 successive transforms.
I’m creating a new plan each time I do a transform. Since the dimensions do not usually change between transforms, I could refactor my code to store the plans and re-use them, but since my timing indicates that plan creation has negligible overhead, I don’t think this will make a difference.
I am calling “cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_FFTW_ALL)” on each plan. (This didn’t appear to make any difference to my tests, but seemed safest.)

Is there something else I need to do to get the full performance gain from the Fermi architecture? Or are these results consistent with what others have seen? Any advice would be appreciated. I can post sample code if that helps, but it’s really very simple stuff, and as I said, the answer appears to be correct in most cases. Thanks.

maringanti · March 10, 2011, 7:44am

I have been working on a similar problem. In the cuFFT manual, it is explained that cuFFT uses two different algorithms for implementing the FFTs. One is the Cooley-Tuckey method and the other is the Bluestein algorithm. When the dimensions have prime factors of only 2,3,5 and 7 e.g (675 = 3^3 x 5^5), then 675 x 675 performs much much better than say 674 x 674 or 677 x 677. This is done using the Cooley-Tuckey method. If one of the prime factors is a prime other than 2,3,5 or 7, then the FFT for that number is implemented using the Bluestein method. The Bluestein method is slower and there is also some precision loss.

To further improve this, it helps if the X dimension is a power of 2. 512 x 675 performs much much better than 511 x 675 (about 8 times faster). A similar situation seems to be present in FFTW as well but the difference seems less pronounced (I need to perform more analysis regarding this).

To get a good performance with cuFFT you need to choose the dimensions wisely. Else it may not give you any speed up.

Jimmy_Pettersson · March 10, 2011, 8:31am

zero padding is not an option?

Topic		Replies	Views
Does cufft show much higher efficiency than cpu fft routines? CUDA Programming and Performance	10	9210	July 19, 2010
Large data size for cuFFT GPU-Accelerated Libraries	8	3964	September 8, 2018
Periodically discrepancies between Cufft and FFTW - output GPU-Accelerated Libraries	1	1555	January 5, 2013
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13507	February 17, 2012
cufft error (?) CUDA Programming and Performance	7	9007	March 5, 2012
CUFFT: calculation time CUDA Programming and Performance	6	2691	April 21, 2012
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13527	October 27, 2010
CUDA enabled cuFFT slower than general purpose FFTW Jetson TX2	10	4421	October 18, 2021
3d CUFFT issues / new implementation? CUDA Programming and Performance	6	5165	June 11, 2008
Profiling using cuFFT GPU-Accelerated Libraries	9	845	December 5, 2019

cufft performance

Related topics