Our research group has recently acquired a TITAN Xp GPU, and I have been trying to set it up for signal processing for a research project, which involves performing FFTs on very large 1-D arrays of input data (typically the size of N=10^7-10^8, or even larger). We’re using double-precision here as single-precision floats doesn’t provide enough accuracy for the application (despite that float_64 is quite slower than float_32 on GeForce GPU)
Still ,the performance is good and offers satisfactory speedup from the CPU implementation for data size N<=10^7. However, whenever my 1-D data size exceeds a given amount (~10^8, for double-precision floating point) the FFT kernel would fail to launch. I checked the cuFFT documentations, and it seems that cuFFT has a size limit on 1-D transforms, namely 64 million for single-precision and 128 million for double-precision.
I’m pretty new to cuFFT and still learning the library (so please forgive me if this sounds like a silly question). But I wonder if there is any way around the size limit of cuFFT? If not, is it possible to perform FFTs in parts? (or, would it be possible to map it to, say, some 2-D FFTs - which are batches of smaller 1-D FFTs - to get around the dimension limit?)
(Also, just another general cuFFT question: when benchmarking, I always see that my first call to CUDA is taking up more time than later calls. Is this normal due to initialization of the CUDA pipeline? If so, is there some way to pre-initialize, i.e. warm up, the card, so that each cuFFT call takes the same time? I’ve tried cudaSetDevice and cudaFree but unfortunately neither makes any difference.)
Thank you very much for your help!