slow runtime caused by cudaMemcpy()

GadiK · November 17, 2009, 4:13pm

I’m using cudaMemcpy() to read data from the device.
This ruins my runtime.

My simplified code:
Sorry the system won’t let me insert the code. I’ll try again later.

runtime:

As you can see the runtime jumps to 3000 which ruins my original (time crucial) program.

Is there any other way to get the data from the device?

Thanks.

Jimmy_Pettersson · November 17, 2009, 4:55pm

you can possibly use streams to speed up your transfer. This entails pinning host memory and doing asynchronous memory transfers. Look in the programming guide, it has a really crappy example there (so maybe SDK examples are better, but they also suck a bit).

OR maybe you can use some of that good ol OpenGL interoperability to directly display your numbers directly onto the screen :)

Sarnath · November 18, 2009, 11:17am

Streams help only if your card supports overlapping kernel execution with memory copies… and of course, if your application can leverage such a trick…

Otherwise, one can consider using pinned memory OR zero-copy method…

Jimmy_Pettersson · November 19, 2009, 9:37am

Hey Sarnath,

Do you know the performance differnce between using regular page locked memory ( cudaMallocHost ) VS using the “pinned” pageable memory (cudaHostAlloc) ?

Sure there is a thread on it somewhere… :)

thanks,
Jim

Sarnath · November 19, 2009, 10:13am

Hey Jim,

As long as your memory is pinned, “cudaMemcpy” will make use of the CARD DMA function for transfer (no matter whether you allocate using cudaMallocHost and cudaHostAlloc). The latter API , I think, was introduced for the zero-copy ==> kernels accessing host memory directly. But I dont think we are talking about it here…

Let me know what you think…

Best Regards,
Sarnath

avidday · November 19, 2009, 11:19am

Yes, cudaHostAlloc() is for zero copy memory, which is something different again - the GPU can directly read and write from PC memory over the PCI-e bus. It is handy in some situations (like pushing small amounts of data back from a reduction kernel, for example), but not really in the same vein as pinned versus pageable memory.

Topic		Replies	Views
Memory copy improvement ? CUDA Programming and Performance	6	3104	April 25, 2012
transfer from pageable host memory to page-locked host memory? CUDA Programming and Performance	3	1073	June 1, 2012
cudaMemcpy takes 30% of my project time. CUDA Programming and Performance	5	4428	July 20, 2009
cudaMemcpyDeviceToHost speed how to improve speed CUDA Programming and Performance	3	12549	June 13, 2008
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	10969	October 16, 2010
why using pinned memory is faster? CUDA Programming and Performance	3	2889	November 30, 2007
zero copy : Device 0 cannot map host memory! zero copy not working, unable to map host memory? CUDA Programming and Performance	4	6504	June 9, 2009
cudaMallocHost How to use CUDA Programming and Performance	6	35517	April 26, 2012
cudaHostAllocMapped CUDA Programming and Performance	5	8085	October 15, 2009
DMA CUDA Programming and Performance	4	8152	October 8, 2009

slow runtime caused by cudaMemcpy()

Related topics