Hello, everybody!
I find in my code the speed of transfering data from cpu to gpu is only about 300M/s. The PCI speed is around 4GB/s. So, I’m puzzled. Is there anyone can help me?
My code sample is simple and I post is below:
for(udword i=0;i<nb;i++)
{
PosList[i] = array[i]->GetMin(Axis0);
Sorted[i] = i;
CenList[4i] = array[i]->GetCenter(Axis0);
CenList[4i+1] = array[i]->GetCenter(Axis1);
CenList[4i+2] = array[i]->GetCenter(Axis2);
CenList[4i+3] = 0;
ExtList[4i] = array[i]->GetExtents(Axis0);
ExtList[4i+1] = array[i]->GetExtents(Axis1);
ExtList[4i+2] = array[i]->GetExtents(Axis2);
ExtList[4i+3] = 0;
}
Sorted[nb] = nb;
PosList[nb] = MAX_FLOAT;
cutilSafeCall( cudaMemcpy( d_cen, CenList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_ext, ExtList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_pos, PosList, sizeof(float)(NUM+1), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_Sorted, Sorted, sizeof(unsigned int)(NUM+1), cudaMemcpyHostToDevice));
nb=32*1024;
It cost 4ms, but the data size is only more than 1M bytes.
So, who can explain and help me to improve it? Thank you very much!
One possibility would be to allocate one big array that contains all your data.
Then you need only one memory copy operation.
The other solution would be to use pinned memory.
You can also combine both.
Thanks.I tried to use pinned memory, but I find the timing is similar.
unsigned int flags = cudaHostAllocWriteCombined;//cudaHostAllocMapped;
cutilSafeCall(cudaHostAlloc((void **)&CenList, sizeof(float)(4mNbBoxes), flags));
cutilSafeCall(cudaHostAlloc((void **)&ExtList, sizeof(float)(4mNbBoxes), flags));
cutilSafeCall(cudaHostAlloc((void **)&PosList, sizeof(float)*(mNbBoxes+1), flags));
cutilSafeCall(cudaHostAlloc((void **)&Sorted, sizeof(udword)*(mNbBoxes+1), flags));
…
…
…
cutilSafeCall( cudaMemcpy( d_cen, CenList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_ext, ExtList, sizeof(float)(NUM4), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_pos, PosList, sizeof(float)*(NUM+1), cudaMemcpyHostToDevice));
cutilSafeCall( cudaMemcpy( d_Sorted, Sorted, sizeof(unsigned int)*(NUM+1), cudaMemcpyHostToDevice));
Did I use pinned memory correct? Thanks again!
I also use mapped memory. But I can’t use it. It has some error on heap.
Try to execute the banwidth benchmark from the sdk … its called bandwidthTest.exe. To benchmark with pinned memeory add “–memory=pinned”.
Then you will know if its a problem of your code or something else.
Here are my results.
bandwidthTest.exe --memory=pinned
Running on......
device 0:Quadro FX 4800
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5780.9
Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5362.8
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 61006.9
&&&& Test PASSED
This is my results. But my bandwith is small than yours, although my GPU is new one.
The host-gpu transfer speeds are mostly a property of the host and not the GPU. From the look of it, I would guess you are either using a PCI-e v1.0 motherboard, or you have your GPU in a PCI-e v2 slot with only 8 lanes.
Thanks so much! I think you’re right.(This link gives the speed information for PCIe 2.0. http://forums.nvidia.com/index.php?showtopic=89084)