I’m trying to utilize cufft in a scientific library I work on, and I’m not sure what kind of performance gain I should be expecting. Specifically, I’ve seen some claims for the speed of 3D transforms that are vastly different than what I’m seeing, and there are other reasons to believe that I may be doing something wrong in my code.
Brief summary: the app is a large set of Python modules, with the computationally intensive parts written in C++ and exposed as Python functions using Boost.Python. We rely heavily on 3D FFTs, always in double-precision, using a C++ port of the fftpack library. Although we have code to link to FFTW, we do not actually use it in production due to license issues. However, since cufft uses an FFTW-like API, it was relatively easy for me to modify our FFTW interface to use cufft instead, without any additional conversion of our native data structures. Both the CPU and GPU transforms are done in-place. The results of cufft and fftpack are close enough to identical in most cases (the exception so far is a 512x512x512 real-to-complex transform, for reasons I haven’t determined yet but which may have something to do with FFTW compatibility mode).
The dimensions of our transforms depend on the specific input data, and are not easily refactored; a typical real-world example was 100x100x72. However, I’ve been testing with powers-of-two sized transforms (e.g. 128x128x128) and irregular sizes. In both cases, the speedup from cufft over single-threaded fftpack is only about 3-4x on a 8-core Intel 5530 system (RedHat 4) with a single Tesla C2050 card. (The memory transfer overhead is approximately 10% of the overall runtime.) For a relatively large transform, on the order of 500x500x500, the overall runtime for the transform alone is approximately 4.5 seconds. (After searching the forum, I added cudaThreadSynchronize() after cufftExec* before determining the runtime, and I’m timing the memory transfers separately, so I’m pretty confident that I’m doing this part right.) What is somewhat more perplexing is that the speedup does not appear to depend on the transform dimensions - it is very consistent across everything I’ve tried, except for some smaller powers-of-2 sizes where cufft is actually slower.
Other details:
- I’m compiling and linking with gcc/g++ 4.1 right now. However, I also tried compiling my C++ sources using nvcc, with no change in outcome.
- running “ldd” on my library definitely shows it linking to libcufft.so, not the emulation library.
- I ran into the initialization overhead almost immediately, so all of my testing is done by running each transform once, then timing 4-8 successive transforms.
- I’m creating a new plan each time I do a transform. Since the dimensions do not usually change between transforms, I could refactor my code to store the plans and re-use them, but since my timing indicates that plan creation has negligible overhead, I don’t think this will make a difference.
- I am calling “cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_FFTW_ALL)” on each plan. (This didn’t appear to make any difference to my tests, but seemed safest.)
Is there something else I need to do to get the full performance gain from the Fermi architecture? Or are these results consistent with what others have seen? Any advice would be appreciated. I can post sample code if that helps, but it’s really very simple stuff, and as I said, the answer appears to be correct in most cases. Thanks.