I would like to know if it will ever be possible to perform a multi-GPU 2D FFT with the cufftMp library without a permutation of the order of the input data. This is mainly because, for my personal application, passing through a “DEVICE_TO_DEVICE” Memcpy is expensive in terms of computational time.
I don’t think it will ever be possible because of the nature the parallel FFT is computed. Multidimensional FFT is simply 1D FFT in each dimension.
For example talking about forward 2D FFT, the data is initially distributed along y axis across all GPUs. So what can be done is a computation of batched 1D FFT for each row in x dimension on each GPU. To compute 1D FFT in y dimension, a transposition has to be performed so the data is distributed along x axis and the batched computation can be done.
After that it would be expensive to perform another transposition to return the data back to the natural distribution, so usually the last transposition is skipped.
However what you can do is adjusting the distribuion of the other data used in spectral domain and distribute them also along x axis, so there is no need to perform the last transposition. Example:
Hi
Is there any way to “return the data back to the natural distribution”? Even if it is expensive, it’s fine with me. But it’s very important for my code.
The real costs for 2D, 3D FFTs are in data permutations & inter-processor communications. A new breakthrough algorithm in multidimensional FFT (Sign Up | LinkedIn) eliminates such overheads. Ask Nvidia to adopt this new algorithm in cuFFT and cuFFTMp for the benefit of all the users.