slow runtime caused by cudaMemcpy()

I’m using cudaMemcpy() to read data from the device.
This ruins my runtime.

My simplified code:
Sorry the system won’t let me insert the code. I’ll try again later.



As you can see the runtime jumps to 3000 which ruins my original (time crucial) program.

Is there any other way to get the data from the device?


you can possibly use streams to speed up your transfer. This entails pinning host memory and doing asynchronous memory transfers. Look in the programming guide, it has a really crappy example there (so maybe SDK examples are better, but they also suck a bit).

OR maybe you can use some of that good ol OpenGL interoperability to directly display your numbers directly onto the screen :)

Streams help only if your card supports overlapping kernel execution with memory copies… and of course, if your application can leverage such a trick…

Otherwise, one can consider using pinned memory OR zero-copy method…

Hey Sarnath,

Do you know the performance differnce between using regular page locked memory ( cudaMallocHost ) VS using the “pinned” pageable memory (cudaHostAlloc) ?

Sure there is a thread on it somewhere… :)


Hey Jim,

As long as your memory is pinned, “cudaMemcpy” will make use of the CARD DMA function for transfer (no matter whether you allocate using cudaMallocHost and cudaHostAlloc). The latter API , I think, was introduced for the zero-copy ==> kernels accessing host memory directly. But I dont think we are talking about it here…

Let me know what you think…

Best Regards,

Yes, cudaHostAlloc() is for zero copy memory, which is something different again - the GPU can directly read and write from PC memory over the PCI-e bus. It is handy in some situations (like pushing small amounts of data back from a reduction kernel, for example), but not really in the same vein as pinned versus pageable memory.