i just noticed that data transfer from device (simple global memory without texture cache) to host is 1.3 times faster than data transfer from host to device.
the transfer of 5,1 MB data from host to device (global memory) takes about 2,39 ms ~> 2,34 GB/sec.
the transfer of the same data size back from device (global memory) to host takes just about 1,81 ms ~> 3,10 GB/sec.
I’m pretty sure this depends more on your motherboard and its chipset and bios than on the GPU. Others have reported more symmetric values.
You can try timing copies from and to page-locked memory. These will probably be even more asymmetric in your case (since the CPU is not involved at all then).