"What is the best practice for CUDA data transfer (CUDA 12.5)?"

illoj · August 16, 2024, 10:35am

Hi.
I’m new to CUDA, previously focused on C++ software development.
My learning path involves going through the CUDA C++ Programming Guide Release 12.5 line by line, and practicing coding with cuda-samples.
The guide mentions that memory allocated with cudaMallocHost is automatically portable and mapped, allowing me to use it directly during kernel launch because of Unified Virtual Address Space.
However, while reviewing related code references, I still see many examples where memory allocated with cudaMallocHost is used with cudaMemcpy or cudaMemcpyAsync.

I’m curious about which approach is the official best practice, or if these two methods are suited for different specific scenarios?

Robert_Crovella · August 16, 2024, 8:01pm

This is an involved topic, and I don’t know if a forum dialog will be sufficient. However, pinned memory has at least 2 canonical uses.

It can serve as directly accessible memory from either host or device code. Using it directly from device code will incur transfer costs, so it can appear to be much slower than accessing data from device memory. Therefore I would suggest, in general, that accessing pinned memory from device code be used carefully and sparingly. It would be hard to call it a best practice unless a very specific case is presented or in view. You’ll find this methodology referred to sometimes as “zero-copy”. There will be some situations where using “zero-copy” makes perfect sense.
Pinned memory is often used for the host side allocation in “typical” H<->D transfer activity. The reason for this is two-fold: A. It generally results in faster transfers. B. It is necessary to achieve overlap of copy and compute. For this type of activity, I would certainly call using pinned memory for the host side allocations a “best practice”.

Curefab · August 17, 2024, 8:09am

Two still mainstreams ones could be

in latency-critical applications zero-copy is faster than blockwise-copying (depending on block-size).
[On the other hand, synchronizing (if needed) could be more complicated, if you do not just use the launch of the kernel, but want to also sync during the run]
If you only need some data from the host, and the exact data locations are only known at device runtime.
Then zero-copy could save bandwidth.
[It will probably be simpler and faster than generating a list of addresses, transfer those to the host, compile the data and transfer the data to the device. And transferring all the data would take the full bandwidth hit]

illoj · August 19, 2024, 1:52am

Thank you very much for your explanation, it has been very helpful to me.

system · September 2, 2024, 1:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	7	6163	February 14, 2008
Pinned Memory zero copy No-copy pinning of system memory CUDA Programming and Performance	3	1091	December 1, 2011
Cudamemcpy CUDA Programming and Performance	1	372	April 27, 2023
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13100	May 4, 2018
selfmade cudeMallocHost()? CUDA Programming and Performance	9	8647	February 14, 2008
question about page locked memory CUDA Programming and Performance	2	8522	April 21, 2009
Does pinned memory can accessed by Device? CUDA Programming and Performance	4	791	March 18, 2024
why using pinned memory is faster? CUDA Programming and Performance	3	2854	November 30, 2007
Possible direct memcpy between CPU (multiple process on one node) and GPU (unified memory on one card) under MPI? CUDA Programming and Performance openmpi	6	209	June 7, 2024
Can I use Unified Memory in a soft real-time system? CUDA Programming and Performance	13	334	April 1, 2024

"What is the best practice for CUDA data transfer (CUDA 12.5)?"

Related topics