Why permuted order for cufftMp multi-GPU 2D FFT?

Hi,

I would like to know if it will ever be possible to perform a multi-GPU 2D FFT with the cufftMp library without a permutation of the order of the input data. This is mainly because, for my personal application, passing through a “DEVICE_TO_DEVICE” Memcpy is expensive in terms of computational time.

Thank you for you answers/opinions.
Emanuele

1 Like

Hi Emanuele,

I don’t think it will ever be possible because of the nature the parallel FFT is computed. Multidimensional FFT is simply 1D FFT in each dimension.

For example talking about forward 2D FFT, the data is initially distributed along y axis across all GPUs. So what can be done is a computation of batched 1D FFT for each row in x dimension on each GPU. To compute 1D FFT in y dimension, a transposition has to be performed so the data is distributed along x axis and the batched computation can be done.

After that it would be expensive to perform another transposition to return the data back to the natural distribution, so usually the last transposition is skipped.

However what you can do is adjusting the distribuion of the other data used in spectral domain and distribute them also along x axis, so there is no need to perform the last transposition. Example:

Natural distribution along y Shuffled distribution along x

David

Hi
Is there any way to “return the data back to the natural distribution”? Even if it is expensive, it’s fine with me. But it’s very important for my code.

Thank you very much
Aaryan

Hi
What you asked is possible if Nvidia adopts the algorithm described here.

Please let me know if you have questions.

The real costs for 2D, 3D FFTs are in data permutations & inter-processor communications. A new breakthrough algorithm in multidimensional FFT (Sign Up | LinkedIn) eliminates such overheads. Ask Nvidia to adopt this new algorithm in cuFFT and cuFFTMp for the benefit of all the users.