cudaMemcpy host->device and device->host speed

Sengin · January 13, 2011, 9:10pm

Hello everyone,
In running my cuda application, I’ve noticed that it takes almost no time at all to copy memory to the device (specifically, less than 1 ms for 356KB) but is very slow at copying memory back to the host (specifically, it takes ~47 ms for only 4KB). This just seems really slow to me. I am using streams that write to the same memory (but at the time it’s copying, I make sure that no stream will write to that small region of memory). Does memcpy take into account how streams may or may not use the memory it’s copying? I am not using cudamemcpyasync to copy the memory to the host because I need to know as soon as I get it. And yes, when I tested the memcpy, I did not use the async version for timing. I know it’s not just a timing issue because the overall execution of the program is slow, and I have targeted it to that section. So whether or not it’s exactly ~47 ms doesn’t matter, I am just wondering why it is so much slower (at best 47 times worse for memory 1/89 the size) and if there’s anything I could do to speed it up (besides waiting a bit longer and copying more data from one memcpy).

Thanks!

tera · January 13, 2011, 10:36pm

How exactly do you measure these times? Are you sure your host->device times includes the time for actual copying, and the device->host time does not include the time of the previous kernel/host->dev copy/whatever?

What OS are you on? Apparently calls get batched under windows, which could also be an explanation.

Sengin · January 14, 2011, 4:04am

I’m measuring the times using the windows function GetTickCount(). However, that’s not just what I’m using. The device is streaming audio data to the host where each kernel is 1 block of 512 threads, each thread processing one sample of data and the streaming audio is really choppy (so choppy that you can’t distinguish anything). I’ve timed all the other sections in the host code, and they all turn up less than 1 ms. The copy is the only statement that gets a number other than 0, every time (usually 47 ms but occasionally 63 ms). As far as being sure of it including the actual copying - I think so. I have a cudaThreadSynchronize(stream), but I’ve timed the amount of time spent waiting as well (from experiments, because the data each kernel runs is so small, each kernel takes less than a millisecond to process the 512 samples) and it came back as 0ms (and it’s before the memcpy).

And yeah, I’m on Windows XP. I’ve tried it on Windows 7 as well, same result (though I didn’t really expect much of a difference).

tera · January 14, 2011, 4:41pm

Are you aware that kernel calls and (in certain cases) host->device copies are asynchronous? The device->host copy is the only call which has a visible effect on the host and as such cannot be executed asynchronously. The 0 ms measurements you get seem to indicate to me that you have mostly just been measuring the tim for asynchronously scheduling the work for execution, but not the time to actually execute it.

Sengin · January 14, 2011, 6:35pm

I am aware that kernel calls are asynchronous (which is why in my testing I used cudaStreamSynchronize() or cudaThreadSynchronize()), but I was under the impression that cudaMemcpy() was synchronous unless you explicitly call cudaMemcpyAsync() - or at least, that’s what I thought I read. But either way, I’m not concerned how long it takes to copy from device to host or host to device, just that it seems to be taking a long time to get back to the host for a very small data chunk. I just thought it was faster than that, so that’s why I was asking (and if there was anything other than copying more data at once to speed it up).

Sakthi · March 19, 2014, 1:02pm

I’m too having the same problem. I’m using a file of size 12MB. The copy to device is almost instantaneous, while it takes around 5hrs+ to copy back from device.

Jnesp · April 29, 2014, 11:19am

I have opened a new topic “Low Memory Throughput (D2H)” because I have got similar problem.

When I try to make a data transfer from device to host I can see on the nVidia Visual Profiler that the throughput of the transfer is very low and I lost all the time improvement I got with the use of Streams (previously on my algorithm).

Any ideas?

Topic		Replies	Views
Slow memory transfers CUDA Programming and Performance	7	1987	May 23, 2011
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1962	December 14, 2018
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5264	February 26, 2010
cudaMemcpy2D() and a few gray hairs It's very slow CUDA Programming and Performance	8	4530	February 13, 2009
Memory Transfer CUDA Programming and Performance	7	2959	October 10, 2008
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	343	April 12, 2022
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2652	July 15, 2009
bets way to return a float value sync or assync CUDA Programming and Performance	26	10310	May 7, 2009
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	887	December 14, 2018
Why is there the difference of memory copy speed between cpu>gpu and gpu>cpu CUDA Programming and Performance	3	1274	April 10, 2014

cudaMemcpy host->device and device->host speed

Related topics