Just to be sure, the only way to split batched 1D R2C/C2R forward/inverse cufft ftt execution between 2 GPUs is via 'cufftXt' library?

CudaaduC · April 14, 2017, 2:02am

I have a very large batched R2C->convolve->C2R cufft process which I define and configure using cufft and ‘cufftPlanMany’.

Currently this works well using 1 GPU, but I wanted to split among 2 GPUs and tried to set devices and streams to get concurrent execution using cuFFT using the same general approach I would use for my own custom kernels.

Looking at the profile output I see that the cufft library calls are serialized between the 2 GPUs, even though I use streams and memcpyAsync(). I mean that GPU 0 finishes its work then GPU 1 starts, even though they are not dependent on each other’s result.

Before anyone jumps down my throat I am very familiar with how to get concurrent kernel execution across multiple GPUs for my own kernels, but this time I would prefer to use cuFFT. It appears that there is some host side equivalent of ‘cudaDeviceSynchronize()’ in the black box cufft calls causing this serialization between the two GPUs.

Did look over the cufftXT documentation and the example in the CUDA 8.0 SDK. Not easy reading, and before I try to make this work for my use case wanted to make sure there is no other way to do this without using cufftXT.

I get that cufftXt provides multi-GPU functionality, but is that the only way to get concurrent execution using two GPUs when using cufft?

Robert_Crovella · April 14, 2017, 2:56am

I didn’t have any trouble with it. I suppose there are any number of possibilities as to why you may be having trouble with no code and no description of your system.

I tried it on Ubuntu 14.04, on a system with CUDA 8.0.61 and two Titan X (Pascal) GPUs. I used this code:

$ cat t98.cu
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

const size_t sig_size = 1<<26;
typedef cufftComplex ctype;
typedef cufftReal rtype;

int main(){

  cufftResult res;
  rtype *d_idata0, *d_idata1;
  ctype *d_odata0, *d_odata1;
  cufftHandle p0, p1;
  cudaStream_t s0, s1;
  cudaSetDevice(0);
  res = cufftPlan1d(&p0, sig_size, CUFFT_R2C, 1);
  assert(res == CUFFT_SUCCESS);
  cudaStreamCreate(&s0);
  res = cufftSetStream(p0, s0);
  assert(res == CUFFT_SUCCESS);
  cudaMalloc(&d_idata0, sizeof(rtype)*sig_size);
  cudaMalloc(&d_odata0, sizeof(ctype)*(sig_size*2+1));
  cudaSetDevice(1);
  res = cufftPlan1d(&p1, sig_size, CUFFT_R2C, 1);
  assert(res == CUFFT_SUCCESS);
  cudaStreamCreate(&s1);
  res = cufftSetStream(p1, s1);
  assert(res == CUFFT_SUCCESS);
  cudaMalloc(&d_idata1, sizeof(rtype)*sig_size);
  cudaMalloc(&d_odata1, sizeof(ctype)*(sig_size*2+1));
  res = cufftExecR2C(p1, d_idata1, d_odata1);
  assert(res == CUFFT_SUCCESS);
  cudaSetDevice(0);
  res = cufftExecR2C(p0, d_idata0, d_odata0);
  assert(res == CUFFT_SUCCESS);
  cudaDeviceSynchronize();
  cudaSetDevice(1);
  cudaDeviceSynchronize();
  return 0;

}
$ nvcc -arch=sm_61 -o t98 t98.cu -lcufft
$

And this is what I observed using the visual profiler (zoomed into the relevant portion of the timeline):

Note that the above image is part of a “temporary” imageshack account which will likely expire/disappear in about 30 days. For future readers, the image shows nearly exact overlap between the 4 kernels launched by the cufft exec call on device 0 with the 4 kernels launched by the cufft exec call on device 1.

CudaaduC · April 14, 2017, 3:16am

Ok, this helps because I wanted to eliminate the possibility that I made some kind of error which appears to be the case.

My case is a bit more complicated but I wanted to see if I can avoid cufftXt and this shows me that it is possible. My test case was very similar to what you posted, so interesting that it did not work as expected.

I am mixing Pascal and Maxwell GPUs, but in the past this did not cause any issues with my multi-GPU applications.

Thanks!

CudaaduC · April 14, 2017, 3:52am

Just to verify your code sample on my Windows system generates the same NVVP output as you posted above. Thanks

Also by making some modifications to the code I was able to get mostly concurrent multi-GPU execution with cuFFT

Topic		Replies	Views
cufft concurrent streams CUDA Programming and Performance	2	1889	August 20, 2014
Complex-to-Complex Convolution using Two GPUs Example GPU-Accelerated Libraries	1	1619	June 15, 2015
cuFFT cudaFuncSetCacheConfig GPU-Accelerated Libraries	3	1674	November 9, 2012
Cufft_R2C and Cufft_C2R are inaccurate GPU-Accelerated Libraries	2	1732	April 11, 2014
cuFFT + streams CUDA Programming and Performance	8	5081	May 18, 2018
cuFFT & OpenMP Asynchronous multi-GPU 3D FFT transformation CUDA Programming and Performance	2	2166	September 14, 2010
The multi-gpu fft 3D R2C problem GPU-Accelerated Libraries	5	661	September 9, 2019
cufftXt batch 1D GPU-Accelerated Libraries	12	2181	October 15, 2019
Implementing cuFFT with streams problem GPU-Accelerated Libraries cufft	3	881	October 12, 2021
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5125	October 29, 2019

Just to be sure, the *only* way to split batched 1D R2C/C2R forward/inverse cufft ftt execution between 2 GPUs is via 'cufftXt' library?

Related topics

Just to be sure, the only way to split batched 1D R2C/C2R forward/inverse cufft ftt execution between 2 GPUs is via 'cufftXt' library?