How to Optimize Data Transfers in CUDA C/C++

Originally published at: https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak…

Nice article. Do I need to be worried about kernel calls, which operate on a mix of host and device variables ? Would you recommend to use all device variables within a kernel.

Kernels are by definition device code. Therefore they must operate only on device memory, or on __managed__ memory (see my post on Unified Memory for more on this: http://devblogs.nvidia.com/...

Hi, Mark
Using Pinned memory I achieved 11GB/s H2D transfer on Titan X (PCI-E 3.0 16x).
Do you think this could be improved?

Hi Mark,
Thank you for a very interesting and helpful article. I have a question: I transfer big data to the GPU once, and than create a cut of it using the GPU, transfer the result back to cpu, and than again, new cut, new transfer. Like scrolling of a plane-cut on 3d data. The most time consuming operation is the transfer back of the result (about 5 Mb) which is about 300 msec. Is it reasonable? is there a way to improve it or this not a task that is suitable for GPU?
Thanks
Hadar

wonderful post! Very helpful.

for future visitors, I found this link helpful https://www.cs.virginia.edu...

Hi,

Why not just use Zero copy, instead of pinned memory and explicit data transfer?
Zero-copy with cudaHostAlloc() allocates Pinned memory and eliminates the need to explicit data transfer with cudaMemcpy().
Also in some cases it allows the kernel to execute while data is being transferred (preventing the need to use multiple streams for the same effect).

Chapter 9.1.3 Zero copy
https://docs.nvidia.com/cud...

Zero copy is a useful tool for your toolbelt. But so are pinned memory transfers. If you will access the data from the device multiple times, it usually makes sense to copy it to device memory and access it there. But if you only need to access it once, from a kernel, then zero copy makes total sense.

ok,

thanks

Hi All!
I’ve read the article and tried the code provided.
However, the results I’ve got are not as expected. I can’t see any significant bandwidth increase for the “pinned memory” case. In some test runs it’s even worse then for “usual” memory. Any idea, why, please?

Device: NVIDIA GeForce GTX 1060 6GB
Transfer size (MB): 16

Pageable transfers
Host to Device bandwidth (GB/s): 0.388620
Device to Host bandwidth (GB/s): 0.417222
Pinned transfers
Host to Device bandwidth (GB/s): 0.359622
Device to Host bandwidth (GB/s): 0.419293

Pageable transfers
Host to Device bandwidth (GB/s): 0.387992
Device to Host bandwidth (GB/s): 0.417588
Pinned transfers
Host to Device bandwidth (GB/s): 0.390906
Device to Host bandwidth (GB/s): 0.418569

Pageable transfers
Host to Device bandwidth (GB/s): 0.387717
Device to Host bandwidth (GB/s): 0.416807
Pinned transfers
Host to Device bandwidth (GB/s): 0.390276
Device to Host bandwidth (GB/s): 0.419327

Pageable transfers
Host to Device bandwidth (GB/s): 0.387582
Device to Host bandwidth (GB/s): 0.417603
Pinned transfers
Host to Device bandwidth (GB/s): 0.389954
Device to Host bandwidth (GB/s): 0.419298

I can’t be sure from the information provided. Perhaps there is a problem with PCI-e on your system. What OS are you on? CUDA Version? Can you include the output of running nvidia-smi?