cudaMemcpyDeviceToHost time procces

Why the time processing to copy from device to host is longer than to copy from host to device.
In my example to copy from host to device requires 13 ms and to copy from device to host 213 ms.
Why is there this difference?

Can you give some more information on your system as well as the results of a bandwidthTest --memory=pinned (it’s an SDK sample)? For reference, I’m getting 4.2 GB/s host to device and 4.56 GB device to host on an 8800 GT on an X38 motherboard. so you shouldn’t be seeing that kind of disparity.

I’m working with a FX370 with a Intel Core 2 Quad CPU Q6600.
The code to copy in device:

cudaMemcpy(ax, hx, 2400 * 1800 * sizeof(short), cudaMemcpyHostToDevice);
cudaMemcpy(ay, hy, 2400 * 1800 * sizeof(short), cudaMemcpyHostToDevice);
cudaMemcpy(az, hz, 2400 * 1800 * sizeof(short), cudaMemcpyHostToDevice);

The code to copy in host:

cudaMemcpy(hx, ax, 2400 * 1800 * sizeof(short), cudaMemcpyDeviceToHost);
cudaMemcpy(hy, ay, 2400 * 1800 * sizeof(short), cudaMemcpyDeviceToHost);
cudaMemcpy(hz, az, 2400 * 1800 * sizeof(short), cudaMemcpyDeviceToHost);

I don´t know why the is thats difference of times?

I need the specifications for the motherboard, what operating system you’re using, and the results of the bandwidth test before I can tell you anything meaningful.

The motherboard is a “Hewlett-Packard HP xw4600 Workstation”, with a chipset “Intel Beachwood X38/X48”.
The operating system is “Microsoft Windows XP Professional”.
And the bandwith test result are 360 MB/s for Host to Device and 487 MB/s for Device to Host.


your bandwidth results are waaaay too low. You have either:

  • not enough power connected to your card
  • not a PCI-E x16 slot
  • you need to update your system BIOS

What he said. You should be getting at least eight times that or so on an X38.