About Data transfer speed between CPU and GPU? How to increase the data transfer speed?

Fu-chang · December 11, 2009, 8:37am

Hello, everybody!
I find in my code the speed of transfering data from cpu to gpu is only about 300M/s. The PCI speed is around 4GB/s. So, I’m puzzled. Is there anyone can help me?
My code sample is simple and I post is below:
for(udword i=0;i<nb;i++)
{
PosList[i] = array[i]->GetMin(Axis0);
Sorted[i] = i;
CenList[4i] = array[i]->GetCenter(Axis0);
CenList[4i+1] = array[i]->GetCenter(Axis1);
CenList[4i+2] = array[i]->GetCenter(Axis2);
CenList[4i+3] = 0;
ExtList[4i] = array[i]->GetExtents(Axis0);
ExtList[4i+1] = array[i]->GetExtents(Axis1);
ExtList[4i+2] = array[i]->GetExtents(Axis2);
ExtList[4i+3] = 0;
}
Sorted[nb] = nb;
PosList[nb] = MAX_FLOAT;
cutilSafeCall( cudaMemcpy( d_cen, CenList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_ext, ExtList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_pos, PosList, sizeof(float)(NUM+1), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_Sorted, Sorted, sizeof(unsigned int)(NUM+1), cudaMemcpyHostToDevice));
nb=32*1024;
It cost 4ms, but the data size is only more than 1M bytes.
So, who can explain and help me to improve it? Thank you very much!

CapJo · December 11, 2009, 9:28am

One possibility would be to allocate one big array that contains all your data.

Then you need only one memory copy operation.

The other solution would be to use pinned memory.

You can also combine both.

Fu-chang · December 11, 2009, 12:35pm

Thanks.I tried to use pinned memory, but I find the timing is similar.

unsigned int flags = cudaHostAllocWriteCombined;//cudaHostAllocMapped;

cutilSafeCall(cudaHostAlloc((void **)&CenList, sizeof(float)(4mNbBoxes), flags));

cutilSafeCall(cudaHostAlloc((void **)&ExtList, sizeof(float)(4mNbBoxes), flags));

cutilSafeCall(cudaHostAlloc((void **)&PosList, sizeof(float)*(mNbBoxes+1), flags));

cutilSafeCall(cudaHostAlloc((void **)&Sorted, sizeof(udword)*(mNbBoxes+1), flags));

…

cutilSafeCall( cudaMemcpy( d_cen, CenList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));

cutilSafeCall( cudaMemcpy( d_ext, ExtList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));

cutilSafeCall( cudaMemcpy( d_pos, PosList, sizeof(float)*(NUM+1), cudaMemcpyHostToDevice));

cutilSafeCall( cudaMemcpy( d_Sorted, Sorted, sizeof(unsigned int)*(NUM+1), cudaMemcpyHostToDevice));

Did I use pinned memory correct? Thanks again!

Fu-chang · December 11, 2009, 12:35pm

I also use mapped memory. But I can’t use it. It has some error on heap.

CapJo · December 11, 2009, 12:49pm

Try to execute the banwidth benchmark from the sdk … its called bandwidthTest.exe. To benchmark with pinned memeory add “–memory=pinned”.

Then you will know if its a problem of your code or something else.

Here are my results.

bandwidthTest.exe --memory=pinned

Running on......

	  device 0:Quadro FX 4800

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5780.9

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5362.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   61006.9

&&&& Test PASSED

Fu-chang · December 11, 2009, 1:02pm

Try to execute the banwidth benchmark from the sdk … its called bandwidthTest.exe. To benchmark with pinned memeory add “–memory=pinned”.

Then you will know if its a problem of your code or something else.

Here are my results.
bandwidthTest.exe --memory=pinned

Running on......

	  device 0:Quadro FX 4800

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5780.9

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5362.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   61006.9

&&&& Test PASSED

This is my results. But my bandwith is small than yours, although my GPU is new one.

avidday · December 11, 2009, 1:04pm

The host-gpu transfer speeds are mostly a property of the host and not the GPU. From the look of it, I would guess you are either using a PCI-e v1.0 motherboard, or you have your GPU in a PCI-e v2 slot with only 8 lanes.

Fu-chang · December 11, 2009, 2:23pm

Thanks so much! I think you’re right.(This link gives the speed information for PCIe 2.0. http://forums.nvidia.com/index.php?showtopic=89084)

Topic		Replies	Views
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2664	April 27, 2009
What factors effect GPU transfer speed? CUDA Programming and Performance	7	9130	September 15, 2009
GPU data speed Jetson TX2 cuda	8	1016	October 18, 2021
Data transfer speed between CPU and GPU CUDA Programming and Performance	5	15452	October 25, 2011
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	9970	June 30, 2009
How to transfer massive data efficiently? CUDA Programming and Performance	5	5895	April 16, 2015
How to Optimize Data Transfers in CUDA C/C++ Technical Blog	12	1218	January 22, 2022
Why i can't use my full PCI-Express bandwidth? CUDA Programming and Performance	7	5097	December 17, 2020
Memory copy improvement ? CUDA Programming and Performance	6	3096	April 25, 2012
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2839	June 24, 2016

About Data transfer speed between CPU and GPU? How to increase the data transfer speed?

Related topics