Cufft load callback causes cufftExecC2C to allocate memory?

I’ve configured a batched FFT that uses a load callback. The load callback is pretty simple. It applies a window and zero pads. However, when I execute cufftExecC2C, it does a cudaMalloc and a cudaFree. The cudaFree ends up causing a delay between the FFT and my next kernel because the cudaFree takes longer than the FFT. This only happens when I set a load callback. If I remove the callback the memory allocation does not occur. It also does not happen when I use just a store callback. I also cannot reproduce it with the simpleCUFFT_callback sample code. Under what circumstances does a load callback cause this memory allocation?

Hi,

Could you match in-place/out-of-place type of calculations to use one from simpleCUFFT_callback?

We recently added LTO version of callbacks in EA program that do not rely on in-place/out-of-place behavior and offer better performance (especially for non-power of 2 FFTs) NVIDIA cuFFT LTO EA Preview we’re looking for feedback on usability on the LTO API

Best regards,
Łukasz Ligowski

I observe the very same thing. In my case the load callback of out-of-place both Z2D and D2Z transform will cause cudaMalloc and cudaFree, leading to almost 2x runtime. (4090, CUDA 12.1, Linux, typical FFT size is 64 * 64 * 64 * 20 batch)
I remember there’s no such issue on CUDA 11.x

1 Like

Could you match in-place/out-of-place type of calculations to use one from simpleCUFFT_callback?

I am using the store callback to shift so I cannot do in-place.

We recently added LTO version of callbacks in EA program that do not rely on in-place/out-of-place behavior and offer better performance (especially for non-power of 2 FFTs) NVIDIA cuFFT LTO EA Preview 1 we’re looking for feedback on usability on the LTO API

This sounds like what I need, but unfortunately preview code is a non-starter. Do you know when this is expected to be released?