Performance degradation from CUDA 11.7 to 12.2 and newer with cuFFT

Using cuFFT with `legacy’ callbacks (the only kind of callback supported before 12.6), we observe a performance decrease when moving from CUDA 11.7 to CUDA 12.2 or 12.4 (12.6 is not available on the HPC platform in question).

This performance decrease is approximately 20%, and the observed behaviour is that the application now spends the majority of its time (which is dominated by FFTs) in cuMemFree_v2 [as reported by nsight systems], and when nsight systems gives a backtrace that call is inside cufftExecC2R or R2C. It has so far only been observed in FFTs with callbacks attached.

Is this known behaviour and is there any workaround? Ideally cuFFT would allocate all needed memory at plan creation time [which is what is heavily implied by the documentation].

You may get assistence on the GPU Accelerated Libraries forum or failing that, filing a bug.

Thanks. crossposted to GPU Accelerated libraries for the moment. I’ll file a proper bug when I can reproduce this without our full HPC code.