Why permuted order for cufftMp multi-GPU 2D FFT?

emanuele.derubeis2 · May 4, 2023, 7:51am

Hi,

I would like to know if it will ever be possible to perform a multi-GPU 2D FFT with the cufftMp library without a permutation of the order of the input data. This is mainly because, for my personal application, passing through a “DEVICE_TO_DEVICE” Memcpy is expensive in terms of computational time.

Thank you for you answers/opinions.
Emanuele

dejvbayer · June 14, 2023, 7:33am

Hi Emanuele,

I don’t think it will ever be possible because of the nature the parallel FFT is computed. Multidimensional FFT is simply 1D FFT in each dimension.

For example talking about forward 2D FFT, the data is initially distributed along y axis across all GPUs. So what can be done is a computation of batched 1D FFT for each row in x dimension on each GPU. To compute 1D FFT in y dimension, a transposition has to be performed so the data is distributed along x axis and the batched computation can be done.

After that it would be expensive to perform another transposition to return the data back to the natural distribution, so usually the last transposition is skipped.

However what you can do is adjusting the distribuion of the other data used in spectral domain and distribute them also along x axis, so there is no need to perform the last transposition. Example:

natural.png1323×603 5.1 KB	shuffled.png1083×1173 6.87 KB
Natural distribution along y	Shuffled distribution along x

David

aaryan_kaushik · October 15, 2024, 4:29pm

Hi
Is there any way to “return the data back to the natural distribution”? Even if it is expensive, it’s fine with me. But it’s very important for my code.

Thank you very much
Aaryan

nv2009 · February 21, 2025, 1:05am

Hi
What you asked is possible if Nvidia adopts the algorithm described here.

Please let me know if you have questions.

nv2009 · February 21, 2025, 5:07am

The real costs for 2D, 3D FFTs are in data permutations & inter-processor communications. A new breakthrough algorithm in multidimensional FFT (Sign Up | LinkedIn) eliminates such overheads. Ask Nvidia to adopt this new algorithm in cuFFT and cuFFTMp for the benefit of all the users.

Topic		Replies	Views
Transposed distribution in cufftMp GPU-Accelerated Libraries cufft	0	13	November 13, 2024
Multi-GPU FFT: 1D real data transforms GPU-Accelerated Libraries cufft	0	365	November 3, 2023
CUFFT on multiple GPUs CUDA Programming and Performance	6	6246	February 15, 2012
Multi-GPU FFT - CUFFT only supports complex-to-complex? CUDA Programming and Performance	0	435	May 2, 2018
FFT on very large data sets CUDA Programming and Performance	7	7655	November 15, 2011
1D CUFFT with 2D image line by line CUDA Programming and Performance	2	13870	May 19, 2009
Question about cuFFT library GPU-Accelerated Libraries cufft	2	757	August 16, 2021
Some conceptual questions on multidimensional cuFFT GPU-Accelerated Libraries	5	958	August 9, 2018
FFT with "implicit" padding GPU-Accelerated Libraries cufft	2	857	February 28, 2022
Strong result with cuFFT GPU-Accelerated Libraries	4	1149	April 15, 2015

Why permuted order for cufftMp multi-GPU 2D FFT?

Related topics