I am running CUDA FFT 1D (cufftExecC2C) using large batch sizes (4000 to 5000 batch size). When I setup the fft plan, I noticed that it takes approx 512 KBytes of global memory per FFT in the plan. So setting up a plan (cufftPlan1d) for a batch size of 4K takes up approx 2GB of global memory on the card . This impacts other kernels I would like to run and I am trying to understand if there is a way to operate the batch mode with lower global memory use. (my FFT size is 32K elements)
The batch mode seems to runs faster than using a For Loop and calling cufftExecC2C N number of times with a batch size of 1.
Has anyone on the forum noticed the same overhead ? is there a better way to use cufft compared to the above approach ?
I would appreciate any pointers on an optimal way to use cufft under these scenarios.