This is more of an observation than a question, but I noticed that the first call to the cuFFT library in an application (in my case a call to cufftPlanMany() ) always takes about 210 ms.
It does appear that this is a “one time cost” at initialization, but wanted to verify this is the case.
Subsequent calls to cufftPlanMany() take less than a millisecond so that indicates it is a one time cost.
Was comparing three different complex convolution codes (my custom code, MATLAB and cuFFT based) and for a small data set I was surprised that my custom code was fastest (does not use cuFFT for the forward and inverse FFT and uses my own DFT code), but then when I isolated the steps I found that cuFFT is indeed faster and slightly more accurate when compared to the equivalent 64-bit process in MATLAB. Since my implementation is a DFT (N^2) I would expect that cuFFT (N*log(N)) would be faster and was surprised by the initial results.
So the takeaway appears to be that one should always profile all steps in detail and use large data sets before making a performance conclusion.