cudaMemcpyDeviceToHost speed how to improve speed

In my program

cudaMemcpy (host, dev, mem_size, cudaMemcpyDeviceToHost) ;

work 100 times slowly then equivalent transfer from the Host to Device.
Moreover, it slowdowns quadratic when mem_size increased.

Could somebody suggest what the origin of the problem is?

Thats very strange. Copying from device to host should be quite fast. Here’s a couple things to check:

  1. Are you copying large chunks of memory or small chunks? If you’re copying a couple megabytes with your cudaMemcpy, it should be quite fast. But if you’re copying a couple bytes with cudaMemcpy, it should be slower.

  2. Are the pointers aligned? I’ve never experienced slow copies with CUDA before so I’ve never really looked into it, but you could check and make sure the pointers are aligned. That may or may not make a performance difference.

  3. This is going to sound silly, but make sure the memory in the host machine is actually in main memory or cache. If all the programs you have running use more memory than the machine physically has, there is a chance that or your data will be temporarily stored to the hard drive. If it was kicked out of host memory to the hard drive, when you try to copy the data from device to the host, the computer will have to pull the data from the hard drive, resulting in a much slower memory copy. At least 40x slower.

You can try CudaMallocHost to make sure the variable doesn’t get stored into pageable memory. Also host to device/ device to host memcopies are faster when you use non-pageable mem for storing data.

read the programming documentation for further help.