Is it normal that cuFFT C2C slower than cufft C2R? I’m trying to do image upsampling with FFT. the R2C process took like 0.5ms, but the C2C inverse after upsampling (doubled width and height) takes 7.7ms. I know since the size is bigger it would take more time, but that seems like too big of an increase?
I further tested just doing FFT on input image and transform them back. seems like C2C is approximately 2x slower than R2C and C2R, I guess it’s because R2C and C2R utilized conjugate symmetry? But still I don’t understand why it’s so much slower in my upsampling code (in my naive thinking, maybe x4 for the size increase, and x2 for not using conjugate symmetry, then it should be around 4ms right?). Below is part of my code
result = cufftPlan2d(&inverseImagePlan, upsamplingHeight, upsamplingWidth, CUFFT_C2C);
if(result!=CUFFT_SUCCESS)
{
std::cerr << "Failed to create inverse FFT plan with code: " << result <<std::endl;
return EXIT_FAILURE;
}
result = cufftPlan2d(&forwardImagePlan, imageHeight, imageWidth, CUFFT_R2C);
if(result!=CUFFT_SUCCESS)
{
std::cerr<<"Failed to create forward FFT plan with code: "<<result<<std::endl;
return EXIT_FAILURE;
}
cudaDeviceSynchronize();
auto pin1 = chrono::high_resolution_clock::now();
result = cufftExecR2C( forwardImagePlan, d_image, d_freqImage );
if( result != CUFFT_SUCCESS )
{
std::cerr<<"forward FFT execution for image failed with code: "<<result<<std::endl;
return EXIT_FAILURE;
}
cudaDeviceSynchronize();
auto pin2 = chrono::high_resolution_clock::now();
Upsampling<<< upsamplingGridSize, upsamplingBlockSize >>>( d_freqImage, imageWidth, imageHeight, d_freqUpsampling, upsamplingWidth );
cudaDeviceSynchronize();
auto pin3 = chrono::high_resolution_clock::now();
result = cufftExecC2C( inverseImagePlan, d_freqUpsampling, d_upsamplingInverse, CUFFT_INVERSE );
if( result != CUFFT_SUCCESS )
{
std::cerr<<"inverse FFT execution for image failed with code: "<<result<<std::endl;
return EXIT_FAILURE;
}
cudaDeviceSynchronize();
auto pin4 = chrono::high_resolution_clock::now();
it’s normal for CUFFT C2C to be slower than C2R or R2C, for the reasons you indicate (not the conjugate symmetry, just the sizes. C2R or R2C vs. C2C also implies a roughly 2x size difference for one of the steps). No, I don’t have an explanation for 0.5ms to 7.7ms, but this is not a complete code and you’ve left out other details such as GPU you are running on, and other data that would be relevant for a performance analysis. You can use a GPU profiler such as nsight systems to look at contributors to the performance difference.
Hi Robert, thank you so much for your reply! I’m using a rtx A4000 gpu. I tried to profile using nsight compute and seems like functions “regluar_bluestein_fft” and “multi_blustein_fft” are using the majority of time. The profiler said the kernel has uncoalesced global memory access and store. However, aren’t they part of the fft library? I’m not sure how I can improve their runtime in my code.
Here is my code: upsamplingFFT.cu - Google Drive
Here is the report created by nsight compute: fft.profout.ncu-rep - Google Drive
sorry to keep bothering and thank you for your time!
I don’t know if you are also asking this question or not, but I have updated my answer there to show how you can do the image upscaling with R2C and C2R.
You won’t be able to affect behavior of cufft kernels. The purpose of suggesting using the profiler was to get a general view of the things that contribute to the difference between 0.5ms and 7.7ms. Ultimately CUFFT is closed source, so you can’t modify it or affect its behavior or “optimize” the code behavior itself. However the profiling may give you an understanding of what causes the difference at a macro level. (And if it were me, the profiler I would use for that is nsight systems to start with, not nsight compute.)
If you are interested in optimizing CUFFT behavior, you can study the cufft guide for instructions on taking over your own management of data allocations (“workspace”) although that may or may not be an issue here - the profiler results might shed light on that.
The other option I can think of for those who want to optimize CUFFT “behavior” would be to try cufftDx. I’m not really suggesting you can do better than reasonable size cufft transforms, but it removes the closed source barrier, to some degree.
Yes I posted that question too! and thank you so much for the updated answer!
I tried your updated split function and use C2R for inverse transform but somehow the runtime for cufftExecC2R is also slow in my code, took around 9.3ms. I’m a little confused now, seems like it’s not because of C2C.
edit: I tried to time your code and the runtime seems to be reasonable, R2C took 0.4ms and C2R after x2 upsampling took 2ms. I guess it’s caused by some other problems in my code.