cufftPlan creation deadly slow on CUDA 11+

Creating any cuFFTplan (through methods such as cufftPlanMany or cufftPlan2d) has become very slow in the latest versions of CUDA, taking about ~0.15s. This is fairly significant when my old i7-8700K does the same FFT in 0.0013s.
Our workflow typically involves doing 2d and 3d FFTs with sizes of about 256, and maybe ~1024 batches.

Unfortunately, both batch size and matrix size changes during our workload, so simply planning once is not really an option. Before CUDA 11, the FFTs and their planning were not a significant bottleneck.

The issue has been confirmed on Fedora 33 and 34 with CUDA 11.3 and 11.4, using the official nvidia repositories. GPUs tested were a Geforce RTX 3090 as well as a 1060.

You may wish to file a bug.

This performance is expected with the latest versions of cuFFT and is due to some underlying changes in CUDA Toolkit.
We hope to improve this perf in the next major release of CUDA Toolkit.

If you investigate with Nsight Systems, you should notice that the bottleneck is different between 11.3 and 11.4

Thanks for your reply. We will try to work around the issue for now.

I also published the test code, in case other people find it useful.

I’ve submitted a PR with a workaround. The issue is caused by repeated cuModuleLoadData, which happens on first plan creation. By running cufftDestroy in the for loop you are forcing two new cuModuleLoadData calls.

Simply store all cufft plans in a vector and destroy at the end of your application.

Before fix

Hello, world!
Time per FFT 0.158185s
Time per FFT 0.0456382s

After workaround

Hello, world!
Time per FFT 0.0261476s
Time per FFT 0.0485883s

this is possibly related