Internal details/limitations of cuFFT, general questions

Good morning, all.

I wrote code which uses cuFFT for 1D operations and it works as it should, but I came across some doubts of its internal work. Maybe you know some of these?

  • Function cufftPlan1d(), second argument is “int nx”, the length of the transform. Is there any reason as to why it is int, and not unsigned int or size_t?

  • Do you manage to get any transform bigger than 2^28 (268435456) to run? This is the biggest I can get to successfully run on a 1080Ti. As far as byte counting goes running a R2C, the float array (input) will be 1GB and the cufftComplex array will be 2GB. When I try a length of 2^29 (536870912), on which total size will be 6GB for the arrays, the operation will stop on the allocation. Is it an internal limit on 1D or something else? I made another operation that takes almost 10GB of the 11GB in the 1080Ti without issues.

  • In 2.2.1 of the cuFFT documentation, https://docs.nvidia.com/cuda/pdf/CUFFT_Library.pdf, it suggests to first create a plan and THEN allocate the memory, which seems to be the opposite of, for example, FFTW. Do you know of any prejudice if we do the opposite? What about when freeing things? First destroy the plan and then cudaFree the arrays? My program currently allocates memory and then creates plan, and destroys plan and then deallocates memory.

  • We don’t use kernel functions to launch a cuFFT process, so how does it do its parallelization? I have the same doubt for cuRAND, which we also launch without kernel specifications.

If you guys know any of these, then I’d like to hear from you.
Thanks a lot for your time and assistance provided many times.

There are other planning functions that can be used for larger array sizes. Read the docs. For instance:

https://docs.nvidia.com/cuda/cufft/index.html#unique_1980839421

Ordinary CUFFT usage will involve the planning step doing an underlying allocation. I think the size of the allocation is not published, but you can break out the memory allocation step separately and manage it yourself. In so doing you can get an idea of how much temporary memory CUFFT requires for various operations. Read the docs. I would assume if your array sizes are 6GB that you are running out of memory due to the temporary allocations that CUFFT makes/requires.

If you are managing the memory allocation yourself, it should not matter if you do the planning process before or after the memory allocation, as long as you allocate sufficient memory.

CUFFT calls are a call to a function in a C-library (the cufft library, i.e. libcufft). The library functions generally will make kernel calls, and may do other CUDA runtime activity as well. You can use a profiler to get an idea of what is happening under the hood.

Thanks for these clarifications, txbob.

Following the document you linked, it looks like I can use the cufftMakePlanMany64() function passing NULL to inembed and onembed so it behaves like an ordinary FFT call, except that it handles the bigger types.

As for the other answers, thanks again. They have enough information for me to dig a bit more, like printing extra return values and running the profiler.