My research group is doing optical imaging, where the system generates 2000 of 1440-element uint16 arrays in every 12ms. These are our raw data, they are stored in one 1D array. By modifying the example code of simpleCUFFT.cu from the CUDA Toolkits, we succeeded to zero-padding the raw data to 2000 of 2048-element arrays(I also changed the data type Complex here)that are stored in 1D array, transferring to GPU through cudaMemCpyHostToDevice, perform FFT by using cufftPlan1d(&plan, 2048, CUFFT_C2C, 2000), and transfer data back to host by cudaMemCpyDevieToHost. All these are successful.
The cudaMemCpyHostToDevice and FFT are very fast, taking less than 1 ms. The problem is, cudaMemCpyDevieToHost is very slow, taking 17ms. (the transfer bandwidth from device to host is 13GB/s, I’m sure this is not the bottleneck) Because of that, we can’t do the whole processing in real-time.
We think this is because the GPU transferring data to CPU while the CPU is very busy. It takes the CPU some time to respond the request from GPU. So we are thinking of creating a CPU thread that’s dedicated to respond to the GPU request. MPI could be a good candidate to do this. But I do’t have experience on MPI, so while reading the documentation of MPI, I would like to ask here that if people here have any idea of how to do that.
Or if you have other ways to overcome the overhead. I highly appreciate your help.