cufftPlan creation deadly slow on CUDA 11+

Creating any cuFFTplan (through methods such as cufftPlanMany or cufftPlan2d) has become very slow in the latest versions of CUDA, taking about ~0.15s. This is fairly significant when my old i7-8700K does the same FFT in 0.0013s.
Our workflow typically involves doing 2d and 3d FFTs with sizes of about 256, and maybe ~1024 batches.

Unfortunately, both batch size and matrix size changes during our workload, so simply planning once is not really an option. Before CUDA 11, the FFTs and their planning were not a significant bottleneck.

The issue has been confirmed on Fedora 33 and 34 with CUDA 11.3 and 11.4, using the official nvidia repositories. GPUs tested were a Geforce RTX 3090 as well as a 1060.

You may wish to file a bug.

This performance is expected with the latest versions of cuFFT and is due to some underlying changes in CUDA Toolkit.
We hope to improve this perf in the next major release of CUDA Toolkit.

If you investigate with Nsight Systems, you should notice that the bottleneck is different between 11.3 and 11.4

Thanks for your reply. We will try to work around the issue for now.

I also published the test code, in case other people find it useful.

I’ve submitted a PR with a workaround. The issue is caused by repeated cuModuleLoadData, which happens on first plan creation. By running cufftDestroy in the for loop you are forcing two new cuModuleLoadData calls.

Simply store all cufft plans in a vector and destroy at the end of your application.

Before fix

Hello, world!
Time per FFT 0.158185s
Time per FFT 0.0456382s

After workaround

Hello, world!
Time per FFT 0.0261476s
Time per FFT 0.0485883s

this is possibly related

But even if so, it would be inacceptable because you need to destroy before you can make a new plan. When doing this often enough it cuts into your performance. In my case significantly, I measure around 350ms on version 11.4 and in contrast only 4.5ms on version 11.2 - for any call to cufftPlanMany. By all means, there is something horribly wrong. Is there already a ticket out on this? I am not updated with the latest 11.5. perhaps it got fixed already?

This is being tracked, but there is no ETA on fix.

inacceptable because you need to destroy before you can make a new plan

This is not true. You can create an endless number of plans before you need to destroy the first one (assuming you have enough host memory)

I still see this happening on our A100 server that runs CentOS Linux release 7.9.2009 when running code that uses CUDA 11.7. We also have a system that runs Ubuntu 20.04 LTS with an RTX 5000 and CUDA 11.7 were we don’t see this issue happening (using the same code).

So is this something specific to CentOS or the A100? Or did we somehow get lucky with the RTX 5000 system?

This issue strictly related to the cuFFT library, regardless of system.

I meant is it related to using the cuFFT library on CentOS and/or the A100, as that is the only system we have seen it on.

Hi all,
I had the same problem: with every 11.x cuda toolkit version FFT computation was incredibly slow (GPU usage around 1-3% on A100). Today I installed the new 12.0.1 version and perfomance improved a lot (GPU usage around 35%).

NVidia took almost 3 years and 25 (!!) 11.x releases to solve this bug… :-/

Bye, S.