cudaMemcpy host->device and device->host speed

Hello everyone,
In running my cuda application, I’ve noticed that it takes almost no time at all to copy memory to the device (specifically, less than 1 ms for 356KB) but is very slow at copying memory back to the host (specifically, it takes ~47 ms for only 4KB). This just seems really slow to me. I am using streams that write to the same memory (but at the time it’s copying, I make sure that no stream will write to that small region of memory). Does memcpy take into account how streams may or may not use the memory it’s copying? I am not using cudamemcpyasync to copy the memory to the host because I need to know as soon as I get it. And yes, when I tested the memcpy, I did not use the async version for timing. I know it’s not just a timing issue because the overall execution of the program is slow, and I have targeted it to that section. So whether or not it’s exactly ~47 ms doesn’t matter, I am just wondering why it is so much slower (at best 47 times worse for memory 1/89 the size) and if there’s anything I could do to speed it up (besides waiting a bit longer and copying more data from one memcpy).

Thanks!

How exactly do you measure these times? Are you sure your host->device times includes the time for actual copying, and the device->host time does not include the time of the previous kernel/host->dev copy/whatever?

What OS are you on? Apparently calls get batched under windows, which could also be an explanation.

I’m measuring the times using the windows function GetTickCount(). However, that’s not just what I’m using. The device is streaming audio data to the host where each kernel is 1 block of 512 threads, each thread processing one sample of data and the streaming audio is really choppy (so choppy that you can’t distinguish anything). I’ve timed all the other sections in the host code, and they all turn up less than 1 ms. The copy is the only statement that gets a number other than 0, every time (usually 47 ms but occasionally 63 ms). As far as being sure of it including the actual copying - I think so. I have a cudaThreadSynchronize(stream), but I’ve timed the amount of time spent waiting as well (from experiments, because the data each kernel runs is so small, each kernel takes less than a millisecond to process the 512 samples) and it came back as 0ms (and it’s before the memcpy).

And yeah, I’m on Windows XP. I’ve tried it on Windows 7 as well, same result (though I didn’t really expect much of a difference).

Are you aware that kernel calls and (in certain cases) host->device copies are asynchronous? The device->host copy is the only call which has a visible effect on the host and as such cannot be executed asynchronously. The 0 ms measurements you get seem to indicate to me that you have mostly just been measuring the tim for asynchronously scheduling the work for execution, but not the time to actually execute it.

I am aware that kernel calls are asynchronous (which is why in my testing I used cudaStreamSynchronize() or cudaThreadSynchronize()), but I was under the impression that cudaMemcpy() was synchronous unless you explicitly call cudaMemcpyAsync() - or at least, that’s what I thought I read. But either way, I’m not concerned how long it takes to copy from device to host or host to device, just that it seems to be taking a long time to get back to the host for a very small data chunk. I just thought it was faster than that, so that’s why I was asking (and if there was anything other than copying more data at once to speed it up).

I’m too having the same problem. I’m using a file of size 12MB. The copy to device is almost instantaneous, while it takes around 5hrs+ to copy back from device.

I have opened a new topic “Low Memory Throughput (D2H)” because I have got similar problem.

When I try to make a data transfer from device to host I can see on the nVidia Visual Profiler that the throughput of the transfer is very low and I lost all the time improvement I got with the use of Streams (previously on my algorithm).

Any ideas?