About Data transfer speed between CPU and GPU? How to increase the data transfer speed?

Hello, everybody!
I find in my code the speed of transfering data from cpu to gpu is only about 300M/s. The PCI speed is around 4GB/s. So, I’m puzzled. Is there anyone can help me?
My code sample is simple and I post is below:
for(udword i=0;i<nb;i++)
{
PosList[i] = array[i]->GetMin(Axis0);
Sorted[i] = i;
CenList[4i] = array[i]->GetCenter(Axis0);
CenList[4
i+1] = array[i]->GetCenter(Axis1);
CenList[4i+2] = array[i]->GetCenter(Axis2);
CenList[4
i+3] = 0;
ExtList[4i] = array[i]->GetExtents(Axis0);
ExtList[4
i+1] = array[i]->GetExtents(Axis1);
ExtList[4i+2] = array[i]->GetExtents(Axis2);
ExtList[4
i+3] = 0;
}
Sorted[nb] = nb;
PosList[nb] = MAX_FLOAT;
cutilSafeCall( cudaMemcpy( d_cen, CenList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_ext, ExtList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_pos, PosList, sizeof(float)(NUM+1), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_Sorted, Sorted, sizeof(unsigned int)
(NUM+1), cudaMemcpyHostToDevice));
nb=32*1024;
It cost 4ms, but the data size is only more than 1M bytes.
So, who can explain and help me to improve it? Thank you very much!

One possibility would be to allocate one big array that contains all your data.

Then you need only one memory copy operation.

The other solution would be to use pinned memory.

You can also combine both.

Thanks.I tried to use pinned memory, but I find the timing is similar.

unsigned int flags = cudaHostAllocWriteCombined;//cudaHostAllocMapped;

cutilSafeCall(cudaHostAlloc((void **)&CenList, sizeof(float)(4mNbBoxes), flags));

cutilSafeCall(cudaHostAlloc((void **)&ExtList, sizeof(float)(4mNbBoxes), flags));

cutilSafeCall(cudaHostAlloc((void **)&PosList, sizeof(float)*(mNbBoxes+1), flags));

cutilSafeCall(cudaHostAlloc((void **)&Sorted, sizeof(udword)*(mNbBoxes+1), flags));

cutilSafeCall( cudaMemcpy( d_cen, CenList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));

cutilSafeCall( cudaMemcpy( d_ext, ExtList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));

cutilSafeCall( cudaMemcpy( d_pos, PosList, sizeof(float)*(NUM+1), cudaMemcpyHostToDevice));

cutilSafeCall( cudaMemcpy( d_Sorted, Sorted, sizeof(unsigned int)*(NUM+1), cudaMemcpyHostToDevice));

Did I use pinned memory correct? Thanks again!

I also use mapped memory. But I can’t use it. It has some error on heap.

Try to execute the banwidth benchmark from the sdk … its called bandwidthTest.exe. To benchmark with pinned memeory add “–memory=pinned”.

Then you will know if its a problem of your code or something else.

Here are my results.

bandwidthTest.exe --memory=pinned

Running on......

	  device 0:Quadro FX 4800

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5780.9

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5362.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   61006.9

&&&& Test PASSED

This is my results. But my bandwith is small than yours, although my GPU is new one.
bandwithTest.jpg

The host-gpu transfer speeds are mostly a property of the host and not the GPU. From the look of it, I would guess you are either using a PCI-e v1.0 motherboard, or you have your GPU in a PCI-e v2 slot with only 8 lanes.

Thanks so much! I think you’re right.(This link gives the speed information for PCIe 2.0. http://forums.nvidia.com/index.php?showtopic=89084)