cuFFT allocating and deallocating memory at every step

Looks like cuFFT is allocating and deallocating memory every time cufftExecC2C is called. I read the documentation and didn’t find any explanation for why this happened. I’m using CUDA 11.8 with callbacks enabled.

For some reason, this doesn’t happen when calling cufftExecC2C in in-place mode (input and output pointers being the same). This behavior is reproducible with this NVIDIA code-sample.

P.S.: Another point I can’t understand is why the callbacks are being called in another kernel. Isn’t the whole point of callbacks to load data from memory a single time?

The cufft docs cover various descriptions of how memory is used. You may wish to read up on workspaces and how to allocate your own workspace. In particular, the interaction of workspaces, and the output buffer for temporary scratch storage by cufft is called out here

CUFFT transforms don’t necessarily imply a single kernel. Many cufft transforms involve a sequence of kernel calls. It is therefore hopefully self-evident that the input callback and the output callback will not necessarily be called by the same kernel. CUFFT is not instituting a separate kernel merely to call a callback. Rather, it has chosen, based on the work you requested, to implement a sequence of kernels, and is calling the callbacks from appropriate places.

Hi, thanks for your reply!

I made some progress in understanding the behavior of cuFFT. These problems were due to recent changes introduced by CUDA 11.8. I tested the same code in CUDA 11.2 and everything ran as I expected without memory allocation during planExec calls and a single kernel call. The performance was also much better.

I’m now waiting for the next version of cuFFT to address those regressions.

Yes, I know. I took this into consideration. In this test case, cuFFT only spawns a single kernel with callbacks disabled.

When a load callback is attached to the plan, cuFFT creates a separate kernel called “separate_callback_loader”. In both cases, the actual FFT block has equivalent runtimes. This implies that cuFFT is creating a separate kernel for the callback load operation negating the performance uplift from using callbacks. Note that this only happens in out-of-place executions as the CUDA 11.8 regression notice called out.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.