Memory copy improvement ?


I’m actually working on developing real-time algorithms for image processing.
I tried a few simple algorithms (like histogram equalization), but I’ve encountered a “huge” bottleneck for my real time application : the memory copy.
I process 1280*1024 pixel images and the memory copy using cudaMemcpy is way too long (about 1 to 2 ms for this size of array). I tried using cudaHostAlloc and cudaHostGetDevicePointer to “optimize” the flow of data, it actually do like nothing, instead of having long memory copy, I’m having longer kernel executions (maybe it comes from the kernel itself).

What is the most efficient way of copying memory from host to device (in term of processing speed) ?
Is there any way to work quickly with the host memory ?

Thank you for your future answers.

Are you using pinned memory (cudaMallocHost)?

You can measure how much time it takes to transfer the array and if you get much less than 6 GB/s, there is something wrong. Otherwise there not much can be done unless transferring only part of the data is still OK. Also, you can consider getting Kepler - it has PCIe 3.0 that should be 2x faster.


I tried using pinned memory, but the time I earn in memory copy is lost in the kernel (maybe because of the way it’s processing).
For example, I tried this on my histogram kernel and NPP “Not arithmetic function” kernel. I measured the time spending to process this, and it shows a huge slow down for the NPP kernel (from 100us to 30ms…) and a little slow down of the histogram kernel (from 6 ms to 7ms).

I benchmarked the time spending to transfer the array, and it gives me that :

Uploaded with

The bandwidth program provided in the SDK gave me a bandwidth of 1.4GB/s (which is indeed very slow…).

I’m using a Tesla S1070.

Thank you for your answer.

Uh, I see. cudaHostAlloc looks like a new name for cudaMallocHost.

S1070 should have PCIe 2.0, so I’d expect 6 GB/s or at least over 5. Some specs, however, mention that x8 is also possible, which is twice slower -

The graph shows ~1.1 GB/s, about the same as 1.4 GB/s. Could you run that SDK program in pinned mode, i.e. using --memory=pinned option?

I don’t entirely understand how pinned memory can slow down the kernel - as long as the data is in the GPU memory, it should not matter… May be this is a problem of measurement, say, do you use asynchronous transfers?

Using -memory=pinned option, I have the same results (1.5 GB/s)…

By asynchronous transfers, you mean using cudaMemcpyAsync ? If yes, I don’t.

Maybe the problem is with the host mainboard(slot operating in 8x or PCIe 1.1 mode).

What is your motherboard?
On both X48 and P45 I have seen roughly 5.2GB/s pinned and 3- 3.2GB/s pageable. 1.4 is unbelievably slow.