CUFFT_INVERSE A4500 vs A40

L-Willis · April 1, 2025, 2:49am

I am performing a 2D convolution operation by taking the Fourier transform, multiplying by a mathematical kernel, and then taking the inverse Fourier transform. I am currently comparing the same gpu accelerated code on A4500 gpus and A40 gpus. After the convolution, the A4500 compares with the CPU convolution. However, the A40 gpus only compares in k-space. After the inverse transform post k-space manipulations, the inverse fft diverges from both the A4500 and CPU codes.

Additionally, Fourier transforming and immediately inverse transforming the variable with no k-space operations lends correct results. Hence, this seems to be an isolated issue with k-space operations.

What could be the problem here? How could it be fixed?

Robert_Crovella · April 1, 2025, 1:06pm

I would generally expect CUFFT to have the same result output, whether the transform is performed on a A40 or on a A4500, to within a small error epsilon, for C2C transforms.

If I witnessed a significant discrepancy, the first thing I would do would be to switch to the latest CUDA version (12.8.1, currently) if you are not already on that, to see if the behavior was the same.

If it is, then probably best to create the shortest reproducer code that shows the variation.

If you are doing R2C/C2R transforms, and your “k-space manipulations” have violated the requirement for hermitian symmetry in the C2R input, then all bets are off, that is a defect in your code.

L-Willis · April 1, 2025, 1:14pm

To clarify, these are C2C in-place transforms. Additionally, all operations have been verified using the A4500 GPUs with R2C/C2R transforms CPU code.

After inspecting the inverse transform further, it appears that somehow the inverse transforms on the A40/A100 GPUs are rearranging the order in some non-predictable way. The rearrangement on the A4500 GPUs were reliably reverted from permuted form using CUFFT_COPY_DEVICE_TO_DEVICE. Is this from the A40/A100 optimizations? Can these options be changed?

Robert_Crovella · April 1, 2025, 1:24pm

Yes, a transform may involve multiple steps (e.g. multiple kernel calls “under the hood”) and the exact sequence of steps may vary by GPU (type).

It’s not obvious to me that this should present a concern unless you are intruding on those steps. Do you have appropriate synchronization between the CUFFT steps and the k-space modification steps?

L-Willis · April 1, 2025, 2:24pm

Your suggestion regarding the version was the problem.
On the A4500s, I was using nvhpc/24.1 (CUDA/12.3.0), but on the A40/A100 I was using nvhpc/22.11 (CUDA/11.8.0). However, after changing the versions on the A40/A100 to nvhpc/24.1 (CUDA/12.3.0) the code is now verified. Thank you for your help!

Topic		Replies	Views
Problem with inverse CuFFT calculations GPU-Accelerated Libraries	0	538	August 28, 2017
CUFFT bug in Cuda 4.0 Release Candidate 2 CUDA Programming and Performance	8	1758	May 5, 2011
cuFFT 1d inverse transform unexpected results GPU-Accelerated Libraries cufft	4	1079	February 27, 2023
Cufft_R2C and Cufft_C2R are inaccurate GPU-Accelerated Libraries	2	1798	April 11, 2014
cuFFT C2C inverse a lot slower than cufft C2R GPU-Accelerated Libraries	4	194	October 18, 2024
Difficulty with CUFFT lib. HELP! cufft Inverse is not giving the supposed results CUDA Programming and Performance	2	1463	March 18, 2010
strange results of convolutionFFT2D GPU-Accelerated Libraries	0	744	December 25, 2015
Cuda FFT and Inverse FFT CUDA Programming and Performance	2	1187	January 25, 2022
Problem in Code CUDA Programming and Performance	1	663	December 22, 2014
involving cufft when(nx, ny, nz)=(225, 329, 1499), the result is error, but other is OK GPU-Accelerated Libraries	4	1069	April 8, 2015

CUFFT_INVERSE A4500 vs A40

Related topics