cufft Batch Mode Overhead Question ? FFT 1D c2c plan overhead of 512KB per FFT ?

I am running CUDA FFT 1D (cufftExecC2C) using large batch sizes (4000 to 5000 batch size). When I setup the fft plan, I noticed that it takes approx 512 KBytes of global memory per FFT in the plan. So setting up a plan (cufftPlan1d) for a batch size of 4K takes up approx 2GB of global memory on the card . This impacts other kernels I would like to run and I am trying to understand if there is a way to operate the batch mode with lower global memory use. (my FFT size is 32K elements)

The batch mode seems to runs faster than using a For Loop and calling cufftExecC2C N number of times with a batch size of 1.

Has anyone on the forum noticed the same overhead ? is there a better way to use cufft compared to the above approach ?

I would appreciate any pointers on an optimal way to use cufft under these scenarios.

Thank You.

I believe CUFFT does the batched FFTs in parallel - hence the large memory usage. You probably want to reduce the batch size to a few hundred and make repeated calls.

Personally I would like to be able to do batched 1D FFTs entirely in-place without needing any additional memory.

Thank you for your response.

Do you know if there is a way to determine the amount of Global Memory cufft reserves when you setup a plan ? In which case I can determine the size of batch to work rather than guessing or working with a fixed batch size. Having this additional information about CUFFT plan would be very helpful in my opinion. At the current time I am either under utilizing global memory or going over depending on the size of FFT batch I choose.