I have been benchmarking various FFTs, and I keep reading that CUFFT should return better results, as I increase the size of my batch, and that by batching the ffts, I should see a marked speedup, but I have yet to find any.
My FFTs are 12288 elements long, and I need to do 540 of them.
If I do one at a time (foolish implementation) I get about .14ms per fft.
If I batch them all together (correct implementation) I get about 77 ms.
The problem, is that I basically get no speedup by running them at the same time. Does anyone know why this could possibly be? I can provide the source code, if anyone wants it, but it seems to me to be trivial source code.
I’m not entirely sure about the batch end of things, but I would recommend zero-padding the fft out to the next power of two (16384 in this case). I think it’s mentioned in the CUFFT guide that there is a specially optimized routine for 1d ffts that are powers of 2 and that CUFFT will not make this adjustment for you. I noticed a significant speedup when I did this for a very large convolution routine.