CUFFT_INVERSE A4500 vs A40

I am performing a 2D convolution operation by taking the Fourier transform, multiplying by a mathematical kernel, and then taking the inverse Fourier transform. I am currently comparing the same gpu accelerated code on A4500 gpus and A40 gpus. After the convolution, the A4500 compares with the CPU convolution. However, the A40 gpus only compares in k-space. After the inverse transform post k-space manipulations, the inverse fft diverges from both the A4500 and CPU codes.

Additionally, Fourier transforming and immediately inverse transforming the variable with no k-space operations lends correct results. Hence, this seems to be an isolated issue with k-space operations.

What could be the problem here? How could it be fixed?

I would generally expect CUFFT to have the same result output, whether the transform is performed on a A40 or on a A4500, to within a small error epsilon, for C2C transforms.

If I witnessed a significant discrepancy, the first thing I would do would be to switch to the latest CUDA version (12.8.1, currently) if you are not already on that, to see if the behavior was the same.

If it is, then probably best to create the shortest reproducer code that shows the variation.

If you are doing R2C/C2R transforms, and your “k-space manipulations” have violated the requirement for hermitian symmetry in the C2R input, then all bets are off, that is a defect in your code.

1 Like

To clarify, these are C2C in-place transforms. Additionally, all operations have been verified using the A4500 GPUs with R2C/C2R transforms CPU code.

After inspecting the inverse transform further, it appears that somehow the inverse transforms on the A40/A100 GPUs are rearranging the order in some non-predictable way. The rearrangement on the A4500 GPUs were reliably reverted from permuted form using CUFFT_COPY_DEVICE_TO_DEVICE. Is this from the A40/A100 optimizations? Can these options be changed?

Yes, a transform may involve multiple steps (e.g. multiple kernel calls “under the hood”) and the exact sequence of steps may vary by GPU (type).

It’s not obvious to me that this should present a concern unless you are intruding on those steps. Do you have appropriate synchronization between the CUFFT steps and the k-space modification steps?

1 Like

Your suggestion regarding the version was the problem.
On the A4500s, I was using nvhpc/24.1 (CUDA/12.3.0), but on the A40/A100 I was using nvhpc/22.11 (CUDA/11.8.0). However, after changing the versions on the A40/A100 to nvhpc/24.1 (CUDA/12.3.0) the code is now verified. Thank you for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.