In my memory, PCIE 1.016 can achieve 4GB/s transferring data between GPU and CPU.
But in my experiment, It spend 0.322ms to transfer a 368288 image(32 bit one pixel, 423936Bytes) from GPU to CPU, which means that the speed is only 1.3GB/s.
I allocate the GPU memory using the function cudaMalloc, and use the following code to calculate the speed.
int num =1000;
I think it might be caused if you use pageable host memory instead of pinned (which is allocated using cuadaMallocHost()). Also your memory transfer size is not very optimal.
A quick way to check this is to run the bandwidthTest in two modes and look for the memory transfer closest to yours:
bandwidthTest --mode=shmoo --memory=pinned
bandwidthTest --mode=shmoo --memory=pageable
With pageable memory I am getting 2897.8 MB/s for 512k bytes, but ~ 5000 with pinned on PCI-E 2.0 x16