Why permuted order for cufftMp multi-GPU 2D FFT?


I would like to know if it will ever be possible to perform a multi-GPU 2D FFT with the cufftMp library without a permutation of the order of the input data. This is mainly because, for my personal application, passing through a “DEVICE_TO_DEVICE” Memcpy is expensive in terms of computational time.

Thank you for you answers/opinions.

Hi Emanuele,

I don’t think it will ever be possible because of the nature the parallel FFT is computed. Multidimensional FFT is simply 1D FFT in each dimension.

For example talking about forward 2D FFT, the data is initially distributed along y axis across all GPUs. So what can be done is a computation of batched 1D FFT for each row in x dimension on each GPU. To compute 1D FFT in y dimension, a transposition has to be performed so the data is distributed along x axis and the batched computation can be done.

After that it would be expensive to perform another transposition to return the data back to the natural distribution, so usually the last transposition is skipped.

However what you can do is adjusting the distribuion of the other data used in spectral domain and distribute them also along x axis, so there is no need to perform the last transposition. Example:

Natural distribution along y Shuffled distribution along x