Creating any cuFFTplan (through methods such as cufftPlanMany or cufftPlan2d) has become very slow in the latest versions of CUDA, taking about ~0.15s. This is fairly significant when my old i7-8700K does the same FFT in 0.0013s.
Our workflow typically involves doing 2d and 3d FFTs with sizes of about 256, and maybe ~1024 batches.
Unfortunately, both batch size and matrix size changes during our workload, so simply planning once is not really an option. Before CUDA 11, the FFTs and their planning were not a significant bottleneck.
The issue has been confirmed on Fedora 33 and 34 with CUDA 11.3 and 11.4, using the official nvidia repositories. GPUs tested were a Geforce RTX 3090 as well as a 1060.