The speed of data transfer between GPU and CPU

In my memory, PCIE 1.016 can achieve 4GB/s transferring data between GPU and CPU.
But in my experiment, It spend 0.322ms to transfer a 368
288 image(32 bit one pixel, 423936Bytes) from GPU to CPU, which means that the speed is only 1.3GB/s.
I allocate the GPU memory using the function cudaMalloc, and use the following code to calculate the speed.
int num =1000;

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );

for(int i = 0; i < num; i ++)
CUDA_SAFE_CALL( cudaMemcpy(d_Data, h_Data, DATA_SIZE, cudaMemcpyHostToDevice); //DATA_SIZE = 368 * 288 * sizeof(float);

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer(hTimer) );
gpuTime = cutGetTimerValue(hTimer) * 1.0 / num;
printf(“…data transfer() time: %f msecs; \n”, gpuTime);

So I am a little confused.

Could someone give me some advices?

Thank you very much!


I think it might be caused if you use pageable host memory instead of pinned (which is allocated using cuadaMallocHost()). Also your memory transfer size is not very optimal.

A quick way to check this is to run the bandwidthTest in two modes and look for the memory transfer closest to yours:
bandwidthTest --mode=shmoo --memory=pinned
bandwidthTest --mode=shmoo --memory=pageable

With pageable memory I am getting 2897.8 MB/s for 512k bytes, but ~ 5000 with pinned on PCI-E 2.0 x16


I have the same results like Demg.

But how can I use pinned memory with cudaMallocHost()?

Just look up the function in the manual, that’s how you allocate pinned memory on the host instead of using malloc() that allocates pageable memory.

Thanks for your help! :rolleyes: