Memory copy improvement ?

Komtuveuh · April 19, 2012, 1:15pm

Hello,

I’m actually working on developing real-time algorithms for image processing.
I tried a few simple algorithms (like histogram equalization), but I’ve encountered a “huge” bottleneck for my real time application : the memory copy.
I process 1280*1024 pixel images and the memory copy using cudaMemcpy is way too long (about 1 to 2 ms for this size of array). I tried using cudaHostAlloc and cudaHostGetDevicePointer to “optimize” the flow of data, it actually do like nothing, instead of having long memory copy, I’m having longer kernel executions (maybe it comes from the kernel itself).

What is the most efficient way of copying memory from host to device (in term of processing speed) ?
Is there any way to work quickly with the host memory ?

Thank you for your future answers.

vvolkov · April 19, 2012, 2:53pm

Are you using pinned memory (cudaMallocHost)?

You can measure how much time it takes to transfer the array and if you get much less than 6 GB/s, there is something wrong. Otherwise there not much can be done unless transferring only part of the data is still OK. Also, you can consider getting Kepler - it has PCIe 3.0 that should be 2x faster.

Komtuveuh · April 20, 2012, 6:34am

Hello,

I tried using pinned memory, but the time I earn in memory copy is lost in the kernel (maybe because of the way it’s processing).
For example, I tried this on my histogram kernel and NPP “Not arithmetic function” kernel. I measured the time spending to process this, and it shows a huge slow down for the NPP kernel (from 100us to 30ms…) and a little slow down of the histogram kernel (from 6 ms to 7ms).

I benchmarked the time spending to transfer the array, and it gives me that :
External Media

Uploaded with ImageShack.us

The bandwidth program provided in the SDK gave me a bandwidth of 1.4GB/s (which is indeed very slow…).

I’m using a Tesla S1070.

Thank you for your answer.

vvolkov · April 20, 2012, 7:20am

Uh, I see. cudaHostAlloc looks like a new name for cudaMallocHost.

S1070 should have PCIe 2.0, so I’d expect 6 GB/s or at least over 5. Some specs, however, mention that x8 is also possible, which is twice slower - Page Not Found | NVIDIA

The graph shows ~1.1 GB/s, about the same as 1.4 GB/s. Could you run that SDK program in pinned mode, i.e. using --memory=pinned option?

I don’t entirely understand how pinned memory can slow down the kernel - as long as the data is in the GPU memory, it should not matter… May be this is a problem of measurement, say, do you use asynchronous transfers?

Komtuveuh · April 20, 2012, 7:59am

Using -memory=pinned option, I have the same results (1.5 GB/s)…

By asynchronous transfers, you mean using cudaMemcpyAsync ? If yes, I don’t.

Nighthawk13 · April 20, 2012, 3:39pm

Maybe the problem is with the host mainboard(slot operating in 8x or PCIe 1.1 mode).

apostglen46 · April 25, 2012, 8:19pm

What is your motherboard?
On both X48 and P45 I have seen roughly 5.2GB/s pinned and 3- 3.2GB/s pageable. 1.4 is unbelievably slow.

Topic		Replies	Views
cudaHostAlloc performance is slow CUDA Programming and Performance	1	1142	June 26, 2012
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2825	June 24, 2016
Host to Device memcpy overhead CUDA Programming and Performance	2	1150	March 17, 2009
Fast processing of large amounts of pinned memory CUDA Programming and Performance	2	715	August 29, 2017
slow runtime caused by cudaMemcpy() CUDA Programming and Performance	5	11751	November 19, 2009
cudahostalloc vs memcpy tradeoff CUDA Programming and Performance	1	1381	November 24, 2014
Problem with cudamemcopy CUDA Programming and Performance	6	1831	September 18, 2009
CPU operation is very slow on memory allocated by cudaMallocHost CUDA Programming and Performance	0	380	October 9, 2018
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3168	September 16, 2010
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	828	October 8, 2018

Memory copy improvement ?

Related topics