I am using the cuFFT library on several GPUs.
It looks like the copy from host to devices is synchronous, implying a large time on the device to host and host to device copy operations.
Is it possible to run cufftXtMemcpy asychronously on multiple devices ?
Not at this time. Could you provide more insight into your use case?
As of CUDA 11.2 (cuFFT 10.4.0), cufftSetStream() is supported in multiple GPU cases. However, calls to cufftXtMemcpy() are still synchronous across multiple GPUs when using streams. In previous versions of cuFFT, cufftSetStream() returns an error in the multiple GPU case. Likewise, calling certain multi-GPU functions such as cufftXtSetCallback() after setting a stream with cufftSetStream() will result in an error (see API functions for more details).
Also, you might trying pinning memory to speed up transfers.