I used CUDA11.1 and VS2019 before, and the performance of CUFFT can basically meet the requirements. A while ago, I updated CUDA12.6 and VS2022, and I found that the performance of creating plans has significantly decreased. I cannot create plans properly when the program starts because the image size is related to the dynamic input parameters. There are three questions
Why does the creation performance of cufft change?
Can we create a plan using the largest image size and reuse images of all sizes
Can you provide some actual numbers, so discussion does not have to take place in a vacuum without any data? Please confirm that the “before” and “after” numbers you will provide were generated with the exact same physical system, with only CUDA and MSVS updates applied, and utilize FFTs with the exact same configuration settings.
It is unfortunate that both MSVS and CUDA were updated at the same time, which means this change is not really a controlled experiment, where only one variable is changed in any one step.
I have read this sentence multiple times now and it is not clear (to me at least) how this observation ties in with the decreased speed of plan generation mentioned earlier. Can you clarify how these two points are related?
Generally speaking, CUFFT plan generation is an activity that takes place on the host system. Changes to host hardware, changes to host software, and load on the host system can therefore impact the performance of plan generation. The first order of business is to track down which change actually caused a negative impact on CUFFT plan generation time. While it seems plausible that this was caused by an update of the CUDA software stack, there is not enough information provided here to refute or confirm this hypothesis.
I am saying the impact of CUDA is plausible because the CUFFT plan generation is driven by heuristics from what I understand, and newer versions of CUFFT may use more complex heuristics or require more input data than older versions, making plan generation slower. If you can demonstrate significant slowdown in an apples-to-apples comparison (i.e. with fixed FFT configuration parameters) between two CUDA versions, you would probably want to file a performance bug with NVIDIA.
If there is a slowdown in plan generation due to more complex heuristics being used, it may be unavoidable. In that case you would want to use a faster host system, in particular paying attention to baseline single-thread performance. My long-standing recommendation for systems designated to run CUDA-accelerated applications is to select host systems whose CPUs operate with >= 3.5 GHz baseline frequency.
questions about CUFFT usage belong on this forum. This possibly related topic discusses that the CUFFT team is/was aware of issues and changes in CUFFT plan creation. It seems evident from the description there that CUFFT plan creation (now) may also cause module loading. This is consistent with a general trend in CUDA towards lazy loading, which have a variety of reasons that support the idea, but is not without some associated issues.
Certainly plan reuse is a good option. Also, as described there, cufftDestroy can cause a situation where module reloading takes place at the next plan creation, therefore as suggested there, another option to consider is storing all your plans in a vector and not destroying them until performance is no longer a concern. Obviously that will have some limits as well, from a workaround perspective. You would not want to store a vector of trillions of plans.
The plan expects a certain size. You can reuse a plan on a smaller size if the data set is padded to the size the plan expects. Padding of FFT data is a common scenario (in my view) but may not fit your needs. It will require you to pad the data and it will also affect the output numerically.
Unless there is some objection, I’ll plan to move this topic over to the other forum I referenced, shortly.
What do you want to do with the Fourier transformed data? Is it used for convolution or correlation and the FFT size is flexible or do you need a specifically sized FFT?
Do you have a small set of sizes or is any size possible?
As those are images (at least or exactly 2D), is the number of rows or columns constant?
Are there parameters to create a possibly slightly slower, but simpler plan in cuFFT?
I use the VS 2019, and tested CUDA version 11.1 and 12.6 respectively. the time of create cufft plan by cuda11.1 is about 0.002 ms, while 1.1 ms in cuda 12.6. the driver is 580.97 and gpu is RTX A4000.
Using your benchmark code, I see very similar CUFFT “create” times (a bit over 1 millisecond) using CUDA 12.8.
You might want to follow-up on the point raised by Robert Crovella above: CUFFT apparently now defaults to lazy module loading, such that the first plan creation now also includes the time for that. To confirm or refute that this explains your observations, you would want to change your benchmark from a create-destroy cycle to a create-create- …. create .. destroy … configuration, and measure the time it takes for each create step separately. If lazy module loading is the issue, we would expect the first plan creation to be slow and all following plan creations to be very fast.
I do not know (1) whether there is an environment variable that allows a user to configure CUFFT module loading behavior analogous to how this can be controlled for the CUDA runtime (2) whether there is an innocuous CUFFT function one can call to trigger lazy module loading at a time that is more convenient to the programmer (for the CUDA runtime, cudaFree(0) used to do that; now it is supposedly a call to cudaSetDevice()).
If the issue is verified to be module loading time, my expectation is that this cost is always there, it’s just that the time is now accounted for at a different place in the overall software execution flow.
If you are unable to resolve the issue to your satisfaction, you may want to file a performance-regression bug with NVIDIA to have them sort it out.
I delete the cufftDestroy and onlay create cufft plan and the result as follows.
vs2019+cuda11.1: 0.16ms
vs2019+cuda12.6: 1.4ms
By the way, I did not set the system environment variable lazy loading mode during the testing process, so it defaults to EAGER. So I think this has nothing to do with lazy loading.
I would like to summarize that cufft may have slowed down due to a change in the heuristic method used to create the plan (not for the first time)? I can report bugs to Nvidia and ask them to help solve them
Unless I misunderstand the data posted by Robert Crovella above, it indicates that CUFFT got faster for both eager and non-eager initialization between CUDA 11.1 and CUDA 12.6.
If so, that would contradict your own observations. In any event you are always free to submit a bug report to NVIDIA. The first step in the handling of bug reports is the attempt by NVIDIA’s engineers to reproduce the reported behavior in house. From what I have seen, this may take multiple iterations depending on the supporting materials included with the bug report.
I retested the performance of VS2019+cuda11.1, cuda12.6, and cuda12.8. The OS is win10x64, CPU is Intel 4215R, GPU is A4000. Below are the test data:
VS2019+cuda11.1:
iteration: 0 milliseconds: 530.928
iteration: 1 milliseconds: 0.188
iteration: 2 milliseconds: 0.18
iteration: 3 milliseconds: 0.157
iteration: 4 milliseconds: 0.171
iteration: 5 milliseconds: 0.148
iteration: 6 milliseconds: 0.169
iteration: 7 milliseconds: 0.161
iteration: 8 milliseconds: 0.144
iteration: 9 milliseconds: 0.146
VS2019+cuda12.6:
iteration: 0 milliseconds: 190.786
iteration: 1 milliseconds: 1.211
iteration: 2 milliseconds: 1.128
iteration: 3 milliseconds: 1.13
iteration: 4 milliseconds: 1.118
iteration: 5 milliseconds: 1.135
iteration: 6 milliseconds: 1.184
iteration: 7 milliseconds: 1.062
iteration: 8 milliseconds: 1.079
iteration: 9 milliseconds: 1.115
VS2019+cuda12.68:
iteration: 0 milliseconds: 185.072
iteration: 1 milliseconds: 1.143
iteration: 2 milliseconds: 1.088
iteration: 3 milliseconds: 1.069
iteration: 4 milliseconds: 1.087
iteration: 5 milliseconds: 1.075
iteration: 6 milliseconds: 1.069
iteration: 7 milliseconds: 1.127
iteration: 8 milliseconds: 1.022
iteration: 9 milliseconds: 1.024
From the test data above, it can be seen that the performance of cuda12.6 and cuda12.8 is similar, but significantly worse compared to cuda11.1. this is the test code