cudaMemcpyDeviceToHost taking much time?

Hi All,

Is any time difference between cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost cudaMemcpy?

There can be a difference between those two copy directions, but it depends on the motherboard. On the systems I have access to, pageable memory copies are 15% slower in the DeviceToHost direction, compared to HostToDevice.

But in my case cudaMemcpy( ) function is taking 0.332 ms for a size of 8006008 data copying from host to device and same function is taking 21.3ms for a size of 240018008 data copying from device to host( which should not be accepted in CUDA).

Hardware details: Quadro CX, CUDA2.2 Drivers, Toolkit and SDK.

Is it because of large memory size? or what could be the reason?

I am expecting answer from nvidia member’s side?

Make sure you call cudaThreadSynchronize() after you launch your kernel and before you stop the timer for measurement. The kernel launch is asynchronous and will return very quickly, so you are probably only measuring the amount of time to launch the kernel. The memcpy from device to host has an implicit thread sync, so you might be measuring both kernel execution time & memcpy time.