Batched FFTs not launching concurrently on multiple GPUs

I am trying to get into CUDA and I’m playing around with some data.

I’m currently trying to run batched cuFFTs on 4 K80 GPUs where each host thread creates a batched cufftPlan and executes it on a set of data.  After that I have a kernel that calculates the magnitude of the fft. The data is read from a global host buffer and cudamemcopyed to each device after cudaSetDevice() is called within the thread.  The code looks something like this:

// Global vars
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

int NFFT = 131072;
int NUM_CHANS_GPU = 360;
cufftComplex* globalHostInputBuffer;

// Main
int main() {
    FILE* fid = fopen('complexInputData.bin', 'r');
    globalHostInputBuffer = (cufftComplex *) calloc(NFFT*NUM_CHANS_GPU, sizeof(cufftComplex));
    fread(globalHostInputBuffer, sizeof(cufftComplex), NFFT*NUM_CHANS_GPU, fid);

    int rs;
    pthread_t threads[4];
    for (int i = 0; i < 4; i++)
    {
        rs = pthread_create(&threads[i], NULL, threadFunc, (void *) &i);
    }
    
    return 0;
}

// Thread function
void *threadFunc(void *threadInput) {
    int threadID = *((int *) threadID);
    cudaSetDevice(threadID);
    cufftComplex* data;
    float* magData;
    cudaMalloc((void**) &data, NFFT*NUM_CHANS_GPU*sizeof(cufftComplex));
    cudaMalloc((void**) &magData, NFFT*NUM_CHANS_GPU*sizeof(float));
    cudaMemcpy(data, globalHostInputBuffer, NFFT*NUM_CHANS_GPU*sizeof(cufftComplex));
    cufftHandle fftPlan;
    cufftPlanMany(&fftPlan, 1, &NFFT, 0, 1, NFFT, 0, 1, NFFT, CUFFT_C2C, NUM_CHANS_GPU);
    cufftExecC2C(fftPlan, data, data);
    calcFFTmag<<<dim3(NUM_CHANS_GPU,0,0),dim3(NFFT,0,0)>>>(data,magData);
    cudaFree(data);
}

When I run this code and I look at the profiler I expect to see the cudamemcpys to the device buffers launch simultaneously. However, I see the cudamemcpys launch at different times. Looking at nvida-smi -lms I see the GPUs all spin up at different times. If I remove everything regarding the fft from the program and keep it threaded I see the cudamemcpys happen at the same time

Is there any reason why the plans would influence the memcpys? I want all these batched ffts to run simultaneously. The program is pretty simple and I am at a loss as to why this is occurring. Any help is appreciated.