Hi.
I’m new to CUDA, previously focused on C++ software development.
My learning path involves going through the CUDA C++ Programming Guide Release 12.5 line by line, and practicing coding with cuda-samples.
The guide mentions that memory allocated with cudaMallocHost is automatically portable and mapped, allowing me to use it directly during kernel launch because of Unified Virtual Address Space.
However, while reviewing related code references, I still see many examples where memory allocated with cudaMallocHost is used with cudaMemcpy or cudaMemcpyAsync.
I’m curious about which approach is the official best practice, or if these two methods are suited for different specific scenarios?
This is an involved topic, and I don’t know if a forum dialog will be sufficient. However, pinned memory has at least 2 canonical uses.
It can serve as directly accessible memory from either host or device code. Using it directly from device code will incur transfer costs, so it can appear to be much slower than accessing data from device memory. Therefore I would suggest, in general, that accessing pinned memory from device code be used carefully and sparingly. It would be hard to call it a best practice unless a very specific case is presented or in view. You’ll find this methodology referred to sometimes as “zero-copy”. There will be some situations where using “zero-copy” makes perfect sense.
Pinned memory is often used for the host side allocation in “typical” H<->D transfer activity. The reason for this is two-fold: A. It generally results in faster transfers. B. It is necessary to achieve overlap of copy and compute. For this type of activity, I would certainly call using pinned memory for the host side allocations a “best practice”.
in latency-critical applications zero-copy is faster than blockwise-copying (depending on block-size).
[On the other hand, synchronizing (if needed) could be more complicated, if you do not just use the launch of the kernel, but want to also sync during the run]
If you only need some data from the host, and the exact data locations are only known at device runtime.
Then zero-copy could save bandwidth.
[It will probably be simpler and faster than generating a list of addresses, transfer those to the host, compile the data and transfer the data to the device. And transferring all the data would take the full bandwidth hit]