CUFFT cudaMemCpyDeviceToHost first call is slow

shuaibin · June 27, 2019, 6:24pm

My research group is doing optical imaging, where the system generates 2000 of 1440-element uint16 arrays in every 12ms. These are our raw data, they are stored in one 1D array. By modifying the example code of simpleCUFFT.cu from the CUDA Toolkits, we succeeded to zero-padding the raw data to 2000 of 2048-element arrays(I also changed the data type Complex here)that are stored in 1D array, transferring to GPU through cudaMemCpyHostToDevice, perform FFT by using cufftPlan1d(&plan, 2048, CUFFT_C2C, 2000), and transfer data back to host by cudaMemCpyDevieToHost. All these are successful.

The cudaMemCpyHostToDevice and FFT are very fast, taking less than 1 ms. The problem is, cudaMemCpyDevieToHost is very slow, taking 17ms. (the transfer bandwidth from device to host is 13GB/s, I’m sure this is not the bottleneck) Because of that, we can’t do the whole processing in real-time.

We think this is because the GPU transferring data to CPU while the CPU is very busy. It takes the CPU some time to respond the request from GPU. So we are thinking of creating a CPU thread that’s dedicated to respond to the GPU request. MPI could be a good candidate to do this. But I do’t have experience on MPI, so while reading the documentation of MPI, I would like to ask here that if people here have any idea of how to do that.

Or if you have other ways to overcome the overhead. I highly appreciate your help.

tera · June 27, 2019, 7:14pm

Take a look at the timeline of your code running under the visual profiler (just prepend nvvp to whichever way you are launching your executable). Most likely you’ll find that the actual device-to-host code is about as fast as the host-to-device copies, but it is waiting for the cuFFT calculations to finish on the GPU.

shuaibin · June 27, 2019, 8:41pm

Hi tera,

Thanks for your reply. So you are saying that the 17ms that I measured for transfer data to host is wrong. And the 1ms that I measured for FFT is wrong, the true time is much larger than 1ms. I’m using the clock() function in time.h to measure the elapsed time.

I tried to use the NVIDIA visual profiler, first it gives me the error code 4087:35, saying that “The user does not have permission to profile on the target device.” Following the link it gives, http://developer.nvidia.com/ERR_NVGPUCTRPERM I found I have to install the NVIDIA Control Panel so that I can grant all users the permission. I installed the driver for my GPU, which is QUADRO P1000. First I downloaded and installed standard driver, but the NVIDIA Control Panel only uses DCH driver. But my system can’t use DCH driver. So ultimately I can’t use the visual profiler. Do you have other methods to check the timeline?

If the FFT is really slow, switching to another beefing GPU should improve it. But it just looks strange to me that performing 2000 FFT should be fairly simple, even with my current GPU.

shuaibin · July 1, 2019, 8:34pm

I have solved the problem. Basically, it’s because I used the clock() function in time.h to measure the GPU time. The clock() function is not appropriate to measure GPU time. After I switch to cudaEvent, the new results are 20 ms for kernel function and 1ms for data transfer back to host. So the real problem is my need a better GPU. The dual direction data transfer rates are same and not really my problem, which confirms with @tera’s answer.

Topic		Replies	Views
Bandwidth: A big problem when using cuFFT cudamemcpy between device and Host maybe bottleneck,etc CUDA Programming and Performance	4	2684	December 23, 2008
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3154	July 10, 2014
Handful of Slow Memory Transfers CUDA Programming and Performance	7	813	June 17, 2016
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	579	November 16, 2021
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1947	December 14, 2018
cudaMemcpyDeviceToHost 3x slower than cudaMemcpyHostToDevice CUDA Programming and Performance	1	866	January 9, 2019
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5263	February 26, 2010
Why Cufft is running slow? Visual Profiler and nvprof cuda	0	627	June 17, 2020
Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r CUDA Programming and Performance	2	907	February 7, 2011
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008

CUFFT cudaMemCpyDeviceToHost first call is slow

Related topics