When I transfer an array of size X from host to device, the transfer is faster than from device to host for the same array of size X. Why is this so?
I am using only cudaMemcpy without pinning or ASYNC options.
How much different? Some difference is normal for various motherboards, but if the difference is large, you might have a problem. (I have seen 20% differences in host-to-device compared to device-to-host. Never did figure out why, though.)