CUFFT cudaMemCpyDeviceToHost first call is slow

My research group is doing optical imaging, where the system generates 2000 of 1440-element uint16 arrays in every 12ms. These are our raw data, they are stored in one 1D array. By modifying the example code of from the CUDA Toolkits, we succeeded to zero-padding the raw data to 2000 of 2048-element arrays(I also changed the data type Complex here)that are stored in 1D array, transferring to GPU through cudaMemCpyHostToDevice, perform FFT by using cufftPlan1d(&plan, 2048, CUFFT_C2C, 2000), and transfer data back to host by cudaMemCpyDevieToHost. All these are successful.

The cudaMemCpyHostToDevice and FFT are very fast, taking less than 1 ms. The problem is, cudaMemCpyDevieToHost is very slow, taking 17ms. (the transfer bandwidth from device to host is 13GB/s, I’m sure this is not the bottleneck) Because of that, we can’t do the whole processing in real-time.

We think this is because the GPU transferring data to CPU while the CPU is very busy. It takes the CPU some time to respond the request from GPU. So we are thinking of creating a CPU thread that’s dedicated to respond to the GPU request. MPI could be a good candidate to do this. But I do’t have experience on MPI, so while reading the documentation of MPI, I would like to ask here that if people here have any idea of how to do that.

Or if you have other ways to overcome the overhead. I highly appreciate your help.

Take a look at the timeline of your code running under the visual profiler (just prepend nvvp to whichever way you are launching your executable). Most likely you’ll find that the actual device-to-host code is about as fast as the host-to-device copies, but it is waiting for the cuFFT calculations to finish on the GPU.

Hi tera,

Thanks for your reply. So you are saying that the 17ms that I measured for transfer data to host is wrong. And the 1ms that I measured for FFT is wrong, the true time is much larger than 1ms. I’m using the clock() function in time.h to measure the elapsed time.

I tried to use the NVIDIA visual profiler, first it gives me the error code 4087:35, saying that “The user does not have permission to profile on the target device.” Following the link it gives, I found I have to install the NVIDIA Control Panel so that I can grant all users the permission. I installed the driver for my GPU, which is QUADRO P1000. First I downloaded and installed standard driver, but the NVIDIA Control Panel only uses DCH driver. But my system can’t use DCH driver. So ultimately I can’t use the visual profiler. Do you have other methods to check the timeline?

If the FFT is really slow, switching to another beefing GPU should improve it. But it just looks strange to me that performing 2000 FFT should be fairly simple, even with my current GPU.

I have solved the problem. Basically, it’s because I used the clock() function in time.h to measure the GPU time. The clock() function is not appropriate to measure GPU time. After I switch to cudaEvent, the new results are 20 ms for kernel function and 1ms for data transfer back to host. So the real problem is my need a better GPU. The dual direction data transfer rates are same and not really my problem, which confirms with @tera’s answer.