cudaMemcpy takes 30% of my project time.

CUDAkk · July 20, 2009, 6:38am

Hi All,

In my program , after doing all the calculations , I have to return back resultant image of size 2400 * 1800 from device to host using cudaMemcpy. But it takes 21 ms , which is very expensive for my program because it takes 30% of over all execution time.

Could any one tell why it happens ?

and How to reduce cudaMemcpy execution time?

Sarnath · July 20, 2009, 7:03am

Try using pinned memory

eyalhir74 · July 20, 2009, 7:19am

You could also try to overlap copying of the current result and launching the next loop’s kernel iteration (well actually the other way round).

Since kernel invocations are async you could do something like this:

// Prepare output space 1.

  // Run kernel iteration 1.

  // Run kernel iteration 2 --> this is async

  // Copy kernel iteration 1 output to output space 1.

  cudaThreadSynchronize();   --> Very Important..

  // Run kernel iteration 3 --> this is async.

  // Copy kernel iteration 2 output...

  cudaThreadSynchronize();  --> Very important.

  ...

  ...

Look for overlapping in google or maybe in the SDKs/samples.

eyal

CUDAkk · July 20, 2009, 1:30pm

I am using pinned memory in my kernel and declaring like :

int *a_h, *x;

cudaDeviceProp prop;

cudaGetDeviceProperties(&prop, 0);

if (!prop.canMapHostMemory)

	exit(0);

cudaSetDeviceFlags(cudaDeviceMapHost);

cudaHostAlloc((void **)&a_h, sizeof(int) * 256, cudaHostAllocMapped);

cudaHostGetDevicePointer((void **)&x, (void *)a_h, 0);

when I wirte

I got the error cudaErrorPriorLaunchFailure .

What is the cause of that error? please help.

CUDAkk · July 20, 2009, 2:43pm

You could also try to overlap copying of the current result and launching the next loop’s kernel iteration (well actually the other way round).

Since kernel invocations are async you could do something like this:
// Prepare output space 1.

  // Run kernel iteration 1.

  // Run kernel iteration 2 --> this is async

  // Copy kernel iteration 1 output to output space 1.

  cudaThreadSynchronize();   --> Very Important..

  // Run kernel iteration 3 --> this is async.

  // Copy kernel iteration 2 output...

  cudaThreadSynchronize();  --> Very important.

  ...

  ...
Look for overlapping in google or maybe in the SDKs/samples.

eyal

Actually my cudaMemcpy is the last call of my program i.e. copy the final output image from device to host. So,

How can use overlap copy?

Keither · July 20, 2009, 3:13pm

try using:

cudaMallocHost instead of cudaMalloc

and

cudaFreeHost instead of cudaFree

declarations are exactly the same you just add “Host” to call diffrent function

it really speeded up my cudaMemcpy operations by a ?_? x100+ times (3ms → ~0.018ms this is just rough figures). it should do wonders for you as your operation takes 21 ms.

cudaMallocHost()

cudaError_t cudaMallocHost(void** hostPtr, size_t size);

allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy*(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of memory with cudaMallocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

cudaFreeHost()

cudaError_t cudaFreeHost(void* hostPtr);

frees the memory space pointed to by hostPtr, which must have been returned by a previous call to cudaMallocHost().

Topic		Replies	Views
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	113	August 22, 2024
Getting data from GPU to CPU without blocking calls CUDA Programming and Performance cuda	1	755	June 18, 2020
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25064	March 8, 2010
bets way to return a float value sync or assync CUDA Programming and Performance	26	10314	May 7, 2009
CUDA kernel 10x slower when operating on cudaMallocManaged memory even when prefetched CUDA Programming and Performance	2	383	March 24, 2023
cudaMemcpy takes large time? CUDA Programming and Performance	2	1953	June 8, 2009
cudaHostAlloc memory initial time CUDA Programming and Performance	0	356	August 19, 2018
cudaMemcpy sometimes doesn't work CUDA Programming and Performance	5	4477	November 13, 2008
Slow memory transfers CUDA Programming and Performance	7	1990	May 23, 2011

cudaMemcpy takes 30% of my project time.

Related topics