I need to do many small 3D FFTs, where many means order of hundreds and small is order of 20x20x20. I would think that each of these cuFFT calls is too small to fully occupy the device. Thus using streams to enable concurrency on fermi cards should greatly improve performance. All of these features seem to be available, which is good.
But… when I actually code it up, I haven’t seen any performance improvement. I wrote a very simple test code to strip out any other factors, and still can’t find the performance that concurrent execution of cuFFT calls should lead to. Specifically, when I simultaneously increase the workload and the number of streams, the runtime increases proportionally to the increased workload when I would expect it to remain approximately the same (+ some overhead).
I’ve attached the test code I’m using for reference. The usage is: “./cuffttest SIZE NUM REDUND NUM_STREAM”, where SIZE is the size of the 3D transform (e.g. 32), NUM is the number of transforms that are batched together with PlanMany, REDUND is the number of times the forward and inverse transform are cycled, and NUM_STREAM is the number of streams used. Each stream gets the same workload: NUM transforms of size SIZExSIZExSIZE, repeated REDUND times. Thus by doubling the number of streams, one is also doubling the workload. If each stream was executing concurrently, one would expect the runtime to be independent of the value of NUM_STREAM.
./cuffttest 32 10 2000 1
./cuffttest 32 10 2000 2
./cuffttest 32 10 2000 4
./cuffttest 32 10 2000 8
./cuffttest 32 10 2000 16
./cuffttest 32 10 2000 32
This shows that, other than some constant overhead (~4 s), the runtime is increasing linearly with the amount of work, despite the fact that the number of streams being used is increasing.
Any help would be greatly appreciated…