I am trying to get into CUDA and I’m playing around with some data.
I’m currently trying to run batched cuFFTs on 4 K80 GPUs where each host thread creates a batched cufftPlan and executes it on a set of data. After that I have a kernel that calculates the magnitude of the fft. The data is read from a global host buffer and cudamemcopyed to each device after cudaSetDevice() is called within the thread. The code looks something like this:
// Global vars
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
int NFFT = 131072;
int NUM_CHANS_GPU = 360;
cufftComplex* globalHostInputBuffer;
// Main
int main() {
FILE* fid = fopen(‘complexInputData.bin’, ‘r’);
globalHostInputBuffer = (cufftComplex ) calloc(NFFTNUM_CHANS_GPU, sizeof(cufftComplex));
fread(globalHostInputBuffer, sizeof(cufftComplex), NFFT*NUM_CHANS_GPU, fid);
int rs;
pthread_t threads[4];
for (int i = 0; i < 4; i++)
{
rs = pthread_create(&threads[i], NULL, threadFunc, (void *) &i);
}
return 0;
}
// Thread function
void threadFunc(void threadInput) {
int threadID = ((int ) threadID);
cudaSetDevice(threadID);
cufftComplex data;
cudaMalloc((void) &data, NFFTNUM_CHANS_GPUsizeof(cufftComplex));
cudaMemcpy(data, globalHostInputBuffer, NFFTNUM_CHANS_GPU*sizeof(cufftComplex));
cufftHandle fftPlan;
cufftPlanMany(&fftPlan, 1, &NFFT, 0, 1, NFFT, 0, 1, NFFT, CUFFT_C2C, NUM_CHANS_GPU);
cufftExecC2C(fftPlan, data, data);
calcFFTmag<<<dim3(NUM_CHANS_GPU,0,0),dim3(NFFT,0,0)>>>(data,magData);
cudaFree(data);
}
When I run this code and I look at the profiler I expect to see the cudamemcpys to the device buffers launch simultaneously. However, I see the cudamemcpys launch at different times. Looking at nvida-smi -lms I see the GPUs all spin up at different times. If I remove everything regarding the fft from the program and keep it threaded I see the cudamemcpys happen at the same time
Is there any reason why the plans would influence the memcpys? I want all these batched ffts to run simultaneously. The program is pretty simple and I am at a loss as to why this is occurring. Any help is appreciated.