Batched FFTs not launching concurrently on multiple GPUs!!

I am trying to get into CUDA and I’m playing around with some data.

I’m currently trying to run batched cuFFTs on 4 K80 GPUs where each host thread creates a batched cufftPlan and executes it on a set of data. After that I have a kernel that calculates the magnitude of the fft. The data is read from a global host buffer and cudamemcopyed to each device after cudaSetDevice() is called within the thread. The code looks something like this:

// Global vars
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

int NFFT = 131072;
int NUM_CHANS_GPU = 360;
cufftComplex* globalHostInputBuffer;

// Main
int main() {
FILE* fid = fopen(‘complexInputData.bin’, ‘r’);
globalHostInputBuffer = (cufftComplex ) calloc(NFFTNUM_CHANS_GPU, sizeof(cufftComplex));
fread(globalHostInputBuffer, sizeof(cufftComplex), NFFT*NUM_CHANS_GPU, fid);

int rs;
pthread_t threads[4];
for (int i = 0; i < 4; i++)
    rs = pthread_create(&threads[i], NULL, threadFunc, (void *) &i);

return 0;


// Thread function
void threadFunc(void threadInput) {
int threadID = ((int ) threadID);
) &data, NFFT
cudaMemcpy(data, globalHostInputBuffer, NFFT
cufftHandle fftPlan;
cufftPlanMany(&fftPlan, 1, &NFFT, 0, 1, NFFT, 0, 1, NFFT, CUFFT_C2C, NUM_CHANS_GPU);
cufftExecC2C(fftPlan, data, data);
When I run this code and I look at the profiler I expect to see the cudamemcpys to the device buffers launch simultaneously. However, I see the cudamemcpys launch at different times. Looking at nvida-smi -lms I see the GPUs all spin up at different times. If I remove everything regarding the fft from the program and keep it threaded I see the cudamemcpys happen at the same time

Is there any reason why the plans would influence the memcpys? I want all these batched ffts to run simultaneously. The program is pretty simple and I am at a loss as to why this is occurring. Any help is appreciated.

If all 4 GPUs are hanging off the same PCIE port of the CPU (even if there are intervening switches) the cudaMemcpy operations will serialize.

Furthermore operations like cudaMalloc may be synchronizing.

In general, for concurrency, my recommendation would be don’t perform anything in the timing sensitive or concurrency-sensitive area that isn’t absolutely necessary.

Do all of your cudaMalloc/cudaMemcpy up front, or if you need to do the cudaMemcpy in a concurrency area, then do a proper cudaMemcpyAsync (e.g. with pinned memory for concurrency) and acknowledge that it may still serialize due to your system design characteristics.

If you launch threads that are simply launching the cufft exec operations, you will have the best chance to witness concurrency of the cufft execution.