Just to be sure, the *only* way to split batched 1D R2C/C2R forward/inverse cufft ftt execution between 2 GPUs is via 'cufftXt' library?

I have a very large batched R2C->convolve->C2R cufft process which I define and configure using cufft and ‘cufftPlanMany’.

Currently this works well using 1 GPU, but I wanted to split among 2 GPUs and tried to set devices and streams to get concurrent execution using cuFFT using the same general approach I would use for my own custom kernels.

Looking at the profile output I see that the cufft library calls are serialized between the 2 GPUs, even though I use streams and memcpyAsync(). I mean that GPU 0 finishes its work then GPU 1 starts, even though they are not dependent on each other’s result.

Before anyone jumps down my throat I am very familiar with how to get concurrent kernel execution across multiple GPUs for my own kernels, but this time I would prefer to use cuFFT. It appears that there is some host side equivalent of ‘cudaDeviceSynchronize()’ in the black box cufft calls causing this serialization between the two GPUs.

Did look over the cufftXT documentation and the example in the CUDA 8.0 SDK. Not easy reading, and before I try to make this work for my use case wanted to make sure there is no other way to do this without using cufftXT.

I get that cufftXt provides multi-GPU functionality, but is that the only way to get concurrent execution using two GPUs when using cufft?

I didn’t have any trouble with it. I suppose there are any number of possibilities as to why you may be having trouble with no code and no description of your system.

I tried it on Ubuntu 14.04, on a system with CUDA 8.0.61 and two Titan X (Pascal) GPUs. I used this code:

$ cat t98.cu
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

const size_t sig_size = 1<<26;
typedef cufftComplex ctype;
typedef cufftReal rtype;

int main(){

  cufftResult res;
  rtype *d_idata0, *d_idata1;
  ctype *d_odata0, *d_odata1;
  cufftHandle p0, p1;
  cudaStream_t s0, s1;
  res = cufftPlan1d(&p0, sig_size, CUFFT_R2C, 1);
  assert(res == CUFFT_SUCCESS);
  res = cufftSetStream(p0, s0);
  assert(res == CUFFT_SUCCESS);
  cudaMalloc(&d_idata0, sizeof(rtype)*sig_size);
  cudaMalloc(&d_odata0, sizeof(ctype)*(sig_size*2+1));
  res = cufftPlan1d(&p1, sig_size, CUFFT_R2C, 1);
  assert(res == CUFFT_SUCCESS);
  res = cufftSetStream(p1, s1);
  assert(res == CUFFT_SUCCESS);
  cudaMalloc(&d_idata1, sizeof(rtype)*sig_size);
  cudaMalloc(&d_odata1, sizeof(ctype)*(sig_size*2+1));
  res = cufftExecR2C(p1, d_idata1, d_odata1);
  assert(res == CUFFT_SUCCESS);
  res = cufftExecR2C(p0, d_idata0, d_odata0);
  assert(res == CUFFT_SUCCESS);
  return 0;

$ nvcc -arch=sm_61 -o t98 t98.cu -lcufft

And this is what I observed using the visual profiler (zoomed into the relevant portion of the timeline):

Note that the above image is part of a “temporary” imageshack account which will likely expire/disappear in about 30 days. For future readers, the image shows nearly exact overlap between the 4 kernels launched by the cufft exec call on device 0 with the 4 kernels launched by the cufft exec call on device 1.

Ok, this helps because I wanted to eliminate the possibility that I made some kind of error which appears to be the case.

My case is a bit more complicated but I wanted to see if I can avoid cufftXt and this shows me that it is possible. My test case was very similar to what you posted, so interesting that it did not work as expected.

I am mixing Pascal and Maxwell GPUs, but in the past this did not cause any issues with my multi-GPU applications.


Just to verify your code sample on my Windows system generates the same NVVP output as you posted above. Thanks

Also by making some modifications to the code I was able to get mostly concurrent multi-GPU execution with cuFFT