The CUFFT documentation states that “Only C2C and Z2Z transform types are supported” on multiple GPUs. (Also, only in-place transforms.) For heavy use of complex-to-real and real-to-complex transforms, one therefore has to choose between
-
Copying data as needed to a complex array and using CUFFT’s multi-GPU routines (and accounting for the permuted order of the results).
-
Writing one’s own routine which, e.g., slab-decomposes a 3D array across GPUs, manually uses CUFFT’s single-GPU 2D and 1D transforms, transposes the data, and performs the remaining dimension(s) of the transform (and possibly redistributes the data).
The first case is much simpler to program, but I am wary of the cost of memory copying. The second option is likely identical to what CUFFT performs in the C2C case, just implementing R2C or C2R transforms instead. I would imagine CUFFT’s implementation of these steps is faster than anything I could write.
My question is - is it at all worthwhile to attempt 2) to attempt to mitigate the memory copy cost (and relative inflexibility) of 1)? Any other experience or insight would be greatly appreciated. (For concreteness, I am writing a code which, in the current single-GPU version, has is ~50% FFT by runtime. I also plan to eventually scale to an MPI (multi-node) implementation, each node with one or multiple GPUs.)