Concurrent cuFFT Calls

I need to do many small 3D FFTs, where many means order of hundreds and small is order of 20x20x20. I would think that each of these cuFFT calls is too small to fully occupy the device. Thus using streams to enable concurrency on fermi cards should greatly improve performance. All of these features seem to be available, which is good.

But… when I actually code it up, I haven’t seen any performance improvement. I wrote a very simple test code to strip out any other factors, and still can’t find the performance that concurrent execution of cuFFT calls should lead to. Specifically, when I simultaneously increase the workload and the number of streams, the runtime increases proportionally to the increased workload when I would expect it to remain approximately the same (+ some overhead).

I’ve attached the test code I’m using for reference. The usage is: “./cuffttest SIZE NUM REDUND NUM_STREAM”, where SIZE is the size of the 3D transform (e.g. 32), NUM is the number of transforms that are batched together with PlanMany, REDUND is the number of times the forward and inverse transform are cycled, and NUM_STREAM is the number of streams used. Each stream gets the same workload: NUM transforms of size SIZExSIZExSIZE, repeated REDUND times. Thus by doubling the number of streams, one is also doubling the workload. If each stream was executing concurrently, one would expect the runtime to be independent of the value of NUM_STREAM.

./cuffttest 32 10 2000 1
5.1 s
./cuffttest 32 10 2000 2
5.9 s
./cuffttest 32 10 2000 4
7.7 s
./cuffttest 32 10 2000 8
11.2 s
./cuffttest 32 10 2000 16
18.4 s
./cuffttest 32 10 2000 32

This shows that, other than some constant overhead (~4 s), the runtime is increasing linearly with the amount of work, despite the fact that the number of streams being used is increasing.

Any help would be greatly appreciated…

Hi Max,

Did you have any thoughts/ work arounds for this since you posted this question ? I have an application where I need to run a few 2D correlations in parallel and was wondering how much performance benefit I can expect if I implement the correlation using FFTs.


To the best of my knowledge, this is a real issue with the cuFFT library and still hasn’t been fixed. Fortunately, the cuFFT library has become much faster all-around since my original post, so this has been less of a performance issue for my application.

For so small problem it will be efficient only if you can make them as batch.