Batched FFTs not launching concurrently on multiple GPUs

jungj7syw · October 24, 2019, 5:45pm

I am trying to get into CUDA and I’m playing around with some data.

I’m currently trying to run batched cuFFTs on 4 K80 GPUs where each host thread creates a batched cufftPlan and executes it on a set of data. After that I have a kernel that calculates the magnitude of the fft. The data is read from a global host buffer and cudamemcopyed to each device after cudaSetDevice() is called within the thread. The code looks something like this:

// Global vars
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

int NFFT = 131072;
int NUM_CHANS_GPU = 360;
cufftComplex* globalHostInputBuffer;

// Main
int main() {
    FILE* fid = fopen('complexInputData.bin', 'r');
    globalHostInputBuffer = (cufftComplex *) calloc(NFFT*NUM_CHANS_GPU, sizeof(cufftComplex));
    fread(globalHostInputBuffer, sizeof(cufftComplex), NFFT*NUM_CHANS_GPU, fid);

    int rs;
    pthread_t threads[4];
    for (int i = 0; i < 4; i++)
    {
        rs = pthread_create(&threads[i], NULL, threadFunc, (void *) &i);
    }
    
    return 0;
}

// Thread function
void *threadFunc(void *threadInput) {
    int threadID = *((int *) threadID);
    cudaSetDevice(threadID);
    cufftComplex* data;
    float* magData;
    cudaMalloc((void**) &data, NFFT*NUM_CHANS_GPU*sizeof(cufftComplex));
    cudaMalloc((void**) &magData, NFFT*NUM_CHANS_GPU*sizeof(float));
    cudaMemcpy(data, globalHostInputBuffer, NFFT*NUM_CHANS_GPU*sizeof(cufftComplex));
    cufftHandle fftPlan;
    cufftPlanMany(&fftPlan, 1, &NFFT, 0, 1, NFFT, 0, 1, NFFT, CUFFT_C2C, NUM_CHANS_GPU);
    cufftExecC2C(fftPlan, data, data);
    calcFFTmag<<<dim3(NUM_CHANS_GPU,0,0),dim3(NFFT,0,0)>>>(data,magData);
    cudaFree(data);
}

When I run this code and I look at the profiler I expect to see the cudamemcpys to the device buffers launch simultaneously. However, I see the cudamemcpys launch at different times. Looking at nvida-smi -lms I see the GPUs all spin up at different times. If I remove everything regarding the fft from the program and keep it threaded I see the cudamemcpys happen at the same time

Is there any reason why the plans would influence the memcpys? I want all these batched ffts to run simultaneously. The program is pretty simple and I am at a loss as to why this is occurring. Any help is appreciated.

Topic		Replies	Views
Batched FFTs not launching concurrently on multiple GPUs!! CUDA Programming and Performance	1	315	November 2, 2019
Batched FFTs not launching concurrently on multiple GPUs CUDA Programming and Performance	0	375	October 31, 2019
CUFFT on multiple cards ? CUDA Programming and Performance	1	2646	April 22, 2010
concurrent copy and execute with cufft possible? CUDA Programming and Performance	1	1980	April 23, 2010
CUFFT with multiple gpus does anyone have experience with this? CUDA Programming and Performance	3	2871	February 20, 2009
Asyncrhonous cuFFT batched execution GPU-Accelerated Libraries	4	563	March 17, 2017
cufft concurrent streams CUDA Programming and Performance	2	1902	August 20, 2014
CUFFT batched basic question CUDA Programming and Performance	2	747	May 3, 2012
cuFFT, MemcpyAsync = gain ? howto use streams CUDA Programming and Performance	2	6567	January 27, 2011
Parallel processing with large arrays CUDA Programming and Performance	9	6314	April 2, 2008

Batched FFTs not launching concurrently on multiple GPUs

Related topics