I have a very large batched R2C->convolve->C2R cufft process which I define and configure using cufft and ‘cufftPlanMany’.
Currently this works well using 1 GPU, but I wanted to split among 2 GPUs and tried to set devices and streams to get concurrent execution using cuFFT using the same general approach I would use for my own custom kernels.
Looking at the profile output I see that the cufft library calls are serialized between the 2 GPUs, even though I use streams and memcpyAsync(). I mean that GPU 0 finishes its work then GPU 1 starts, even though they are not dependent on each other’s result.
Before anyone jumps down my throat I am very familiar with how to get concurrent kernel execution across multiple GPUs for my own kernels, but this time I would prefer to use cuFFT. It appears that there is some host side equivalent of ‘cudaDeviceSynchronize()’ in the black box cufft calls causing this serialization between the two GPUs.
Did look over the cufftXT documentation and the example in the CUDA 8.0 SDK. Not easy reading, and before I try to make this work for my use case wanted to make sure there is no other way to do this without using cufftXT.
I get that cufftXt provides multi-GPU functionality, but is that the only way to get concurrent execution using two GPUs when using cufft?